st 10. 12. 2025 v 15:24 odesílatel Tomáš Glozar <[email protected]> napsal:
>
> That is a good point. I will look at whether selective scheduling has
> any significant performance benefits on ia64 at the current state at
> least. My system is built with -O2, but some tests I have been doing
> with -O3.
>
I did some testing with GCC 15.2.0 [1] on ia64. It appears that there
is a noticeable but small improvement of 0.136% (7.196491 tok/s vs
7.186699 tok/s) when running llama2.c inference with -O3 compared to
-O3 -fno-selective-scheduling (whose core is floating-point
matrix-vector multiplication), That is not a representative example
though, as IIUC, selective scheduling should be mostly relevant for
non-numerical computation [2].
For a more relevant benchmark, I build p7zip_16.02 with -O3, once with
selective scheduling, once without it:
tglozar@epic-t2 /tmp/p7zip_16.02 $ ../7za_ss b
...
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
Avr: 540 815 4395 | 788 695 5472
Tot: 664 755 4934
tglozar@epic-t2 /tmp/p7zip_16.02 $ ../7za_no_ss b
...
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
Avr: 598 876 5234 | 789 700 5517
Tot: 693 788 5375
It appears that the effect of selective scheduling is -16.02% for
compressing, and -0.81% for decompressing, compared to baseline with
no selective scheduling - that is, a negative effect. I'm not sure if
the same numbers would be seen for other workloads, they will likely
vary. IIRC, the original performance numbers are similar, having
decreased performance for some workloads and increased for others.
[1] Built with a patch disabling late combine by default on IA-64 as
that was shown to produce incorrect code in some cases.
[2] https://dl.acm.org/doi/abs/10.1145/267959.269966
Tomas