On Sun, May 17, 2026 at 11:17:06AM -0700, Roman Gushchin wrote: > > I actually tried to run it with ollama on my > personal framework 13. Adding nominal support is trivial, but the > whole thing is not really useful: I can get maybe few hundreds > tokens per second using a quantified model with reduced quality; an > average sashiko review is consuming 3.5 millions tokens (with Gemini > 3.1 pro, it’s also model-dependent).
I'm curious. What hardware and LLM model were you using? A few hundred tokens per second seems surprising high. My initial research[1] showes that an M5 Max Macbook Pro costing 5 or 6 kilobucks can do 31.6 tokens/second on a 27B 4-bit Quanitized model (Qwen 3.5). [1] https://www.reddit.com/r/LocalLLaMA/comments/1rzkw4x/m5_max_128g_performance_tests_i_just_got_my_new/ The model matters of course. With Gemma 3 27B and a 6-bit quantization, it's 21 tokens/s, and with Deepseek R1 8B Q6_K, it's 72.8 tokens/second. But unless you're using a really low-end model, or a really expensive, splufty hardware platform, I haven't seen reports of hundreds of tokens per second on hardware costing a reasonable amount of memory. (I'll set aside the question of whether spending $6k for a fully spec'ed out M5 Max Macbook Pro, or $15k for a fully spec'ed out M3 Ultra Mac Studio is "reasonable".) As a result I'm not entirely sure how realistic it is to do reviews using "free" (you still have to pay $$$ for the hardware) local, open-weight LLM's if an average review requires around 3.5 million tokens. Cheers, - Ted

