> I repeated the measurements using perf stat on multiple isolated
> cores, including runs after reboot and on different days. Increasing
> the number of iterations from -r 10 to -r 100 did not change the outcome.

Thanks, that's good to know.

> In the generated code for SciMark2, the compiler selects almost
> exclusively LMUL=M1 (only two MF2 occurrences in the whole assembly),
> so LMUL scaling itself is effectively a no-op here. Therefore, my assumption
> is that the difference in performance is caused by the base M1 latencies.
>
> In the previous MD model, the measured load latency did not follow a
> power-of-two relationship across LMULs (M1=3, M2=4, M4=8, M8=16).
> To make this compatible with the dynamic -madjust-lmul-cost scaling,
> I normalized M1 to 2 so higher LMULs could be approximated as ×2, ×4,
> etc. Otherwise this would result in 3/6/12/24 for M1/M2/M4/M8, which
> deviates significantly more from the measured 4/8/16 at higher LMULs
> than adjusting M1 from 3 to 2. This improves the fit for M2/M4/M8, but
> likely reduces accuracy for the dominant M1 case.

Hmm, so we have both cases then:  One where modelling the latency exactly
helps (Scimark), and one where modelling exactly is significantly worse (the 
two others) :)

Insn scheduling is always a heuristic.  How this is usually approached is 
benchmark a large number of tests/applications and check which setting performs 
best over all.  Without including SPEC and others we might be in the 
overfitting territory.

I'm not 100% sure how to continue here.  One one hand, I'd like to avoid too 
much manual twiddling for the sole purpose of getting LMUL latency right.  
There's also still the issue of VLS modes with -mrvv-vector-bits=zvl.
Those would all get assigned LMUL1 latencies right now, while the hook would 
use the proper scaling.
So for a "proper" solution there's still more work to be done:
Including lmul-scaling into the cost model, having a broader test base, maybe 
adding a custom, per uarch, lmul-scale curve/factor, etc.

On the other hand, your first patch clearly shows an improvement, and, even if 
not optimal, would improve the status quo.  It's unlikely we have time for the 
full solution, so maybe we should settle for a partial one for now?

Other opinions?

-- 
Regards
 Robin

Reply via email to