https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
Bug ID: 94406 Summary: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: andre.simoesdiasvieira at arm dot com Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPEC 2017 FPrate benchmark 503.bwaves_r compiled with -Ofast -march=native -mtune=native runs 11% slower on AMD Zen2 CPUs when built with trunk (revision abe13e1847f) than when compiled with GCC 9.2. Bisecting led to commit: commit 1297712fb4af6c6bfd827e0f0a9695b14669f87d Author: Andre Vieira <andre.simoesdiasvie...@arm.com> Date: Thu Oct 31 09:49:47 2019 +0000 [vect]Make vect-epilogues-nomask=1 default This patch turns epilogue vectorization on by default for all targets. From-SVN: r277659 If we use current trunk but build also with option --param vect-epilogues-nomask=0 we get run-time on par with GCC 9. This is also the reason why generic march/tuning or building with -mprefer-vector-width=128 currently results in faster code than simple -march=native. Interestingly, I do not see this issue on an Intel Cascade Lake Server CPU, even though the epilogue is created there too - judging by CFG of the hottest function which looks the same. And I am not sure to what extent it tells anything at all, but I accidentally also perf'ed load-to-store-stall events and in the slow version, the reported "samples" was 10% higher and the reported "event count" shot up 2.8 times(!). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)