> I repeated the measurements using perf stat on multiple isolated > cores, including runs after reboot and on different days. Increasing > the number of iterations from -r 10 to -r 100 did not change the outcome.
Thanks, that's good to know. > In the generated code for SciMark2, the compiler selects almost > exclusively LMUL=M1 (only two MF2 occurrences in the whole assembly), > so LMUL scaling itself is effectively a no-op here. Therefore, my assumption > is that the difference in performance is caused by the base M1 latencies. > > In the previous MD model, the measured load latency did not follow a > power-of-two relationship across LMULs (M1=3, M2=4, M4=8, M8=16). > To make this compatible with the dynamic -madjust-lmul-cost scaling, > I normalized M1 to 2 so higher LMULs could be approximated as ×2, ×4, > etc. Otherwise this would result in 3/6/12/24 for M1/M2/M4/M8, which > deviates significantly more from the measured 4/8/16 at higher LMULs > than adjusting M1 from 3 to 2. This improves the fit for M2/M4/M8, but > likely reduces accuracy for the dominant M1 case. Hmm, so we have both cases then: One where modelling the latency exactly helps (Scimark), and one where modelling exactly is significantly worse (the two others) :) Insn scheduling is always a heuristic. How this is usually approached is benchmark a large number of tests/applications and check which setting performs best over all. Without including SPEC and others we might be in the overfitting territory. I'm not 100% sure how to continue here. One one hand, I'd like to avoid too much manual twiddling for the sole purpose of getting LMUL latency right. There's also still the issue of VLS modes with -mrvv-vector-bits=zvl. Those would all get assigned LMUL1 latencies right now, while the hook would use the proper scaling. So for a "proper" solution there's still more work to be done: Including lmul-scaling into the cost model, having a broader test base, maybe adding a custom, per uarch, lmul-scale curve/factor, etc. On the other hand, your first patch clearly shows an improvement, and, even if not optimal, would improve the status quo. It's unlikely we have time for the full solution, so maybe we should settle for a partial one for now? Other opinions? -- Regards Robin
