On Sat, 2026-03-21 at 08:49 +0100, Christian Kastner wrote:
> There are still two downsides:
> 
>   * libggml0-backend-all would not be allowed to depend on
>     libggml0-backend-cuda, which I expect to be the most popular
>     backend. So this could be a bit counterintuitive (though xorg
>     has a similar problem, and people seem to be able to deal with it).
> 
>   * I still need to test how cooperative the Vulkan backend is with
>     the vendor-specific ones. These currently conflict because they
>     require the same hardware resource, but from what Mathieu told me, I
>     may have been overly cautious, as the backend selection mechanism
>     seems to pick the vendor-specific one over Vulkan, rather than
>     using both.

Indeed, I have never noticed conflicts between the Vulkan and CUDA
backends. I tend to use one or the other in general though: the Vulkan
backend is more portable, has less footprint and is easier to deploy,
the CUDA backend is more mature.

But I would like to add a third downside:

  * The GPU backends are much more dependent than the CPU backend(s) of
the deployment context, in terms of configuration and
model/quantization choice. The most common problem is the GPU VRAM
size: not only should the model fit in, but also each session's
context. Otherwise performance can be very poor, or it simply won't
work. One can then tinker, with lower quantizations, using MoE models,
quantizing the KV cache, etc. But while there has been a lot of
progress in llama.cpp in terms of error feedback for such issues and
auto-tuning of the parameters, the GPU backends will very often not
work out of the box, where the robust CPU backend would have been slow
but working.

Cheers,

Mathieu

Reply via email to