branch: externals/minuet commit 0cea13a0da31d21ce1bdb1a6415bc6baa5428aa2 Author: Milan Glacier <d...@milanglacier.com> Commit: Milan Glacier <d...@milanglacier.com>
doc: add section `Understanding Model Speed`. --- README.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/README.md b/README.md index 0723c457db..6fe1289050 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ - [Llama.cpp Qwen-2.5-coder:1.5b](#llamacpp-qwen-25-coder15b) - [API Keys](#api-keys) - [Selecting a Provider or Model](#selecting-a-provider-or-model) + - [Understanding Model Speed](#understanding-model-speed) - [Prompt](#prompt) - [Configuration](#configuration) - [minuet-provider](#minuet-provider) @@ -285,6 +286,28 @@ significantly slow down the default provider used by Minuet (`openai-fim-compatible` with deepseek). We recommend trying alternative providers instead. +## Understanding Model Speed + +For cloud-based providers, +[Openrouter](https://openrouter.ai/google/gemini-2.0-flash-001/providers) +offers a valuable resource for comparing the speed of both closed-source and +open-source models hosted by various cloud inference providers. + +When assessing model speed via Openrouter, two key metrics are latency (time to +first token) and throughput (tokens per second). Latency is often a more +critical factor than throughput. + +Ideally, one would aim for a latency of less than 1 second and a throughput +exceeding 100 tokens per second. + +For local LLM, +[llama.cpp#4167](https://github.com/ggml-org/llama.cpp/discussions/4167) +provides valuable data on model speed for 7B models running on Apple M-series +chips. The two crucial metrics are `Q4_0 PP [t/s]`, which measures latency +(tokens per second to process the KV cache, equivalent to the time to generate +the first token), and `Q4_0 TG [t/s]`, which indicates the tokens per second +generation speed. + # Prompt See [prompt](./prompt.md) for the default prompt used by `minuet` and