[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #5 from Richard Biener --- Hmm, we're invoking memset from libc which might use a different path on CXL compared to Zen2? Note that a vectorized epilogue should in no way cause additional store-to-load forwarding penalties _but_ it might cause additional (positive) store-to-load forwardings. Code-generation wise the loop leaves a lot to be desired and given we know the number of iterations is 5 the vectorized epilogue will never be entered thus its overhead will only hurt. Maybe CXL branch prediction behaves better here. Note there's room for improvement in the way we dispatch to the vectorized epilogue. Exiting the main vectorized loop we do if (do_we_need_an_epilouge) { then for the vectorized epilogue we do if (remaining-niters == 1) do scalar epilogue else do vector epilogue where the complication is due to the fact that we share the scalar epilogue loops with the loop used when the runtime cost model check fails. Thus the CFG with vectorized epilogue could be more optimally structured reducing the overhead to a single jump-around. For bwaves the other improvement opportunity is to move the memset out of the full loop nest rather than just covering the innermost two loops. That probably improves register allocation.
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #4 from Martin Jambor --- For the record, on AMD Zen2 at least, SPEC 2006 410.bwaves also runs about 12% faster with --param vect-epilogues-nomask=0 (and otherwise with -Ofast -march=native -mtune=native).
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #3 from Martin Jambor --- One more data point, binary compiled for cascadelake does not run on Zen2, but one for znver2 runs on Cascade Lake and it makes no difference in run-time. If disapling epilogues helps on Intel, the difference is less than 2%.
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #2 from Martin Jambor --- And for completeness, LNT sees this too and has just managed to catch the regression: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0=295.427.0;
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #1 from Martin Jambor --- For the record, the collected profiles both for the traditional "cycles:u" event and (originally unintended) "ls_stlf:u" event are below: -Ofast -march=native -mtune=native # Samples: 894K of event 'cycles:u' # Event count (approx.): 735979402525 # # Overhead Samples Command Shared Object Symbol # ... . # 67.18%599542 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 11.40%102686 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 11.37%101388 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] bi_cgstab_block_ 6.95% 62694 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 1.88% 16957 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 1.01% 9023 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned # Samples: 769K of event 'ls_stlf:u' # Event count (approx.): 154704730574 # # Overhead Samples Command Shared Object Symbol # ... # 94.59%612921 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 1.83% 88259 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 1.12% 13615 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 1.11% 43093 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 1.05% 8746 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned -Ofast -march=native -mtune=native --param vect-epilogues-nomask=0 # Samples: 816K of event 'cycles:u' # Event count (approx.): 671104061807 # # Overhead Samples Command Shared Object Symbol # ... . # 64.07%521532 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 12.50%102670 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 12.39%100777 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] bi_cgstab_block_ 7.60% 62641 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 2.06% 16925 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 1.17% 9531 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned # Samples: 705K of event 'ls_stlf:u' # Event count (approx.): 55009340780 # # Overhead Samples Command Shared Object Symbol # ... .. # 86.26%532930 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 5.15% 88270 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 3.17% 13696 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 3.06% 57149 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 1.59% 9226 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned