branch: externals/minuet
commit 0cea13a0da31d21ce1bdb1a6415bc6baa5428aa2
Author: Milan Glacier <d...@milanglacier.com>
Commit: Milan Glacier <d...@milanglacier.com>

    doc: add section `Understanding Model Speed`.
---
 README.md | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/README.md b/README.md
index 0723c457db..6fe1289050 100644
--- a/README.md
+++ b/README.md
@@ -8,6 +8,7 @@
   - [Llama.cpp Qwen-2.5-coder:1.5b](#llamacpp-qwen-25-coder15b)
 - [API Keys](#api-keys)
 - [Selecting a Provider or Model](#selecting-a-provider-or-model)
+  - [Understanding Model Speed](#understanding-model-speed)
 - [Prompt](#prompt)
 - [Configuration](#configuration)
   - [minuet-provider](#minuet-provider)
@@ -285,6 +286,28 @@ significantly slow down the default provider used by Minuet
 (`openai-fim-compatible` with deepseek). We recommend trying alternative
 providers instead.
 
+## Understanding Model Speed
+
+For cloud-based providers,
+[Openrouter](https://openrouter.ai/google/gemini-2.0-flash-001/providers)
+offers a valuable resource for comparing the speed of both closed-source and
+open-source models hosted by various cloud inference providers.
+
+When assessing model speed via Openrouter, two key metrics are latency (time to
+first token) and throughput (tokens per second). Latency is often a more
+critical factor than throughput.
+
+Ideally, one would aim for a latency of less than 1 second and a throughput
+exceeding 100 tokens per second.
+
+For local LLM,
+[llama.cpp#4167](https://github.com/ggml-org/llama.cpp/discussions/4167)
+provides valuable data on model speed for 7B models running on Apple M-series
+chips. The two crucial metrics are `Q4_0 PP [t/s]`, which measures latency
+(tokens per second to process the KV cache, equivalent to the time to generate
+the first token), and `Q4_0 TG [t/s]`, which indicates the tokens per second
+generation speed.
+
 # Prompt
 
 See [prompt](./prompt.md) for the default prompt used by `minuet` and

Reply via email to