Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

2024-03-10 Thread Petter Reinholdtsen
[Christian Kastner]
> I'm open for better ideas, though.

I find in general that programs written with run time selection of
optimizations are far superiour to per host compilations, at least from
a system administration viewpoint.  I guess such approach would require
rewriting llama.cpp, and have no idea how much work it would be.

I look forward to having a look at your git repo to see if there is
something there I can learn from for the whisper.cpp packaging.

-- 
Happy hacking
Petter Reinholdtsen



Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

2024-03-09 Thread Christian Kastner
Hey Ptter,

On 2024-03-08 20:21, Petter Reinholdtsen wrote:
> [Christian Kastner 2024-02-13]
>> I'll push a first draft soon, though it will definitely not be
>> upload-ready for the above reasons.
> 
> Where can I find the first draft?

I've discarded the simple package and now plan another approach: a
package that ships a helper to rebuild the utility when needed, similar
to DKMS. Rationale:
  * Continuously developed upstream, no build suited for stable
  * Build optimized for the current host's hardware, which is a key
feature. Building for our amd64 ISA standard would be absurd.
I'm open for better ideas, though.

I had to pause this primarily because of ROCm infrastructure work and
our updates to 5.7 in preparation of the gfx1100,gfx1101,gfx1102
architectures, and that is still my focus.

Incidentally, we could use some help with that, see thread at [1].
MIOpen in particular is something that our ROCm stack will eventually need.

Best,
Christian

[1] https://lists.debian.org/debian-ai/2024/03/msg00029.html
[2] https://github.com/ROCm/MIOpen



Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

2024-03-08 Thread Petter Reinholdtsen
[Christian Kastner 2024-02-13]
> I'll push a first draft soon, though it will definitely not be
> upload-ready for the above reasons.

Where can I find the first draft?
-- 
Happy hacking
Petter Reinholdtsen



Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

2024-02-13 Thread Christian Kastner
Hi Petter,

On 2024-02-13 08:36, Petter Reinholdtsen wrote:
> I tried building the CPU edition on one machine and run it on another,
> and experienced illegal instruction exceptions.  I suspect this mean one
> need to be careful when selecting build profile to ensure it work on all
> supported Debian platforms.

yeah, that was my conclusion from my first experiments as well.

This is a problem though, since one key point of llama.cpp is to make
best use of the current hardware. If we'd target some 15-year-old amd64
lowest common denominator, we'd go against that.

In my first experiments, I've also had problems with ROCm builds on
hosts without a GPU.

I have yet to investigate if/how capabilities can be generally enabled,
and use determined at runtime.

Another issue that stable is clearly the wrong distribution for this.
This is a project that is continuously gaining new features, so we'd
need to stable-updates.

> I would be happy to help getting this up and running.  Please let me
> know when you have published a git repo with the packaging rules.

I'll push a first draft soon, though it will definitely not be
upload-ready for the above reasons.

Best,
Christian



Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

2024-02-12 Thread Petter Reinholdtsen


I tried building the CPU edition on one machine and run it on another,
and experienced illegal instruction exceptions.  I suspect this mean one
need to be careful when selecting build profile to ensure it work on all
supported Debian platforms.

I would be happy to help getting this up and running.  Please let me
know when you have published a git repo with the packaging rules.

-- 
Happy hacking
Petter Reinholdtsen



Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

2024-02-10 Thread Christian Kastner
Package: wnpp
Severity: wishlist
Owner: Christian Kastner 
X-Debbugs-Cc: debian-de...@lists.debian.org, debian...@lists.debian.org

* Package name: llama.cpp
  Version : b2116
  Upstream Author : Georgi Gerganov
* URL : https://github.com/ggerganov/llama.cpp
* License : MIT
  Programming Lang: C++
  Description : Inference of Meta's LLaMA model (and others) in pure C/C++

The main goal of llama.cpp is to enable LLM inference with minimal
setup and state-of-the-art performance on a wide variety of hardware -
locally and in the cloud.

* Plain C/C++ implementation without any dependencies
* Apple silicon is a first-class citizen - optimized via ARM NEON,
  Accelerate and Metal frameworks
* AVX, AVX2 and AVX512 support for x86 architectures
* 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for
  faster inference and reduced memory use
* Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD
  GPUs via HIP)
* Vulkan, SYCL, and (partial) OpenCL backend support
* CPU+GPU hybrid inference to partially accelerate models larger than
  the total VRAM capacity

This package will be maintained by the Debian Deep Learning Team.