> On May 17, 2026, at 11:57 AM, Theodore Tso <[email protected]> wrote:
> On Sun, May 17, 2026 at 11:17:06AM -0700, Roman Gushchin wrote:
>> 
>> I actually tried to run it with ollama on my
>> personal framework 13. Adding nominal support is trivial, but the
>> whole thing is not really useful: I can get maybe few hundreds
>> tokens per second using a quantified model with reduced quality; an
>> average sashiko review is consuming 3.5 millions tokens (with Gemini
>> 3.1 pro, it’s also model-dependent).
> 
> I'm curious.  What hardware and LLM model were you using?  A few
> hundred tokens per second seems surprising high.  My initial
> research[1] showes that an M5 Max Macbook Pro costing 5 or 6 kilobucks
> can do 31.6 tokens/second on a 27B 4-bit Quanitized model (Qwen 3.5).

I’ve framework 13 with amd 7840u. I’ve tried several models both on cpu and 
gpu. 
Sorry, it was a couple of months ago and I don’t remember all the details, so I 
won’t 
claim any specific numbers, but as I remember the best numbers were around 
a hundred tokens per second. In any case it’s few orders of magnitude slower 
than
 what is realistically required.

If someone has a powerful hardware and is willing to benchmark sashiko with 
open-source
models, I’m very interested in results.

> [1] 
> https://www.reddit.com/r/LocalLLaMA/comments/1rzkw4x/m5_max_128g_performance_tests_i_just_got_my_new/
> 
> The model matters of course.  With Gemma 3 27B and a 6-bit
> quantization, it's 21 tokens/s, and with Deepseek R1 8B Q6_K, it's
> 72.8 tokens/second.  But unless you're using a really low-end model,
> or a really expensive, splufty hardware platform, I haven't seen
> reports of hundreds of tokens per second on hardware costing a
> reasonable amount of memory.  (I'll set aside the question of whether
> spending $6k for a fully spec'ed out M5 Max Macbook Pro, or $15k for a
> fully spec'ed out M3 Ultra Mac Studio is "reasonable".)
> 
> As a result I'm not entirely sure how realistic it is to do reviews
> using "free" (you still have to pay $$$ for the hardware) local,
> open-weight LLM's if an average review requires around 3.5 million
> tokens.

Fully agree. But it might change in few years, things are moving quickly.

Reply via email to