https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90128
Bug ID: 90128 Summary: 507.cactuBSSN_r is 9-11% slower at -Ofast and native march/tuning on Zen CPUs Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux In my own measurements, 507.cactuBSSN_r is about 9.4% slower on an AMD Zen CPU when compiled with GCC 9 with -Ofast and native march/mtune than when it si compiled with GCC 8. LNT currently even shows 11.4% regression: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch I have done some bisecting and the slowdown happened in three steps. First, the benchmark slowed by about 2% at some point before r262510 which I have not tracked down yet. Second, it then dived 3% with r263874 but this seems to be some code-placement issue again because the assembly of the functions which gained perf samples has not changed in that revision and perf reported stalled-cycles-frontend went from 4.58% to 5.02%. However, the third regression was caused by the immediately following revision r263875, the difference is 4.5% (7.5% is compared to GCC 8 run-time) while perf reported stalled-cycles-frontend were only 4.05%. r263872 (good) perf stat and report: Performance counter stats for 'numactl -C 0 -l specinvoke': 238848.989836 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 92923 page-faults:u # 0.389 K/sec 758195547230 cycles:u # 3.174 GHz (83.33%) 34727040659 stalled-cycles-frontend:u # 4.58% frontend cycles idle (83.33%) 15457735869 stalled-cycles-backend:u # 2.04% backend cycles idle (83.33%) 1225370192228 instructions:u # 1.62 insn per cycle # 0.03 stalled cycles per insn (83.33%) 23031544594 branches:u # 96.427 M/sec (83.34%) 18985096 branch-misses:u # 0.08% of all branches (83.33%) 239.158442295 seconds time elapsed # Event count (approx.): 758374775503 # # Overhead Samples Command Shared Object Symbol # ........ ......... ............ ................. ......................................... # 40.51% 387505 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_RHS_Body 22.34% 214782 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_Advect_Body 8.42% 80594 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_constraints_Body 7.40% 70897 cactusBSSN_r libm-2.26.so __ieee754_exp_avx 5.77% 55393 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBaseDtLapseShift_Body 4.99% 47952 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBase_Body 2.98% 28573 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_InitRHS_Body 2.44% 23623 cactusBSSN_r cactusBSSN_r_peak MoL_LinearCombination r263874 (worse) perf stat and report: Performance counter stats for 'numactl -C 0 -l specinvoke': 244036.523777 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 93013 page-faults:u # 0.381 K/sec 774757677736 cycles:u # 3.175 GHz (83.33%) 38930288027 stalled-cycles-frontend:u # 5.02% frontend cycles idle (83.33%) 15508961324 stalled-cycles-backend:u # 2.00% backend cycles idle (83.34%) 1226167776333 instructions:u # 1.58 insn per cycle # 0.03 stalled cycles per insn (83.33%) 23218262947 branches:u # 95.143 M/sec (83.33%) 18890390 branch-misses:u # 0.08% of all branches (83.33%) 244.344340731 seconds time elapsed # Samples: 979K of event 'cycles' # Event count (approx.): 775138268715 # # Overhead Samples Command Shared Object Symbol # ........ ......... ............ ................. ......................................... # 41.43% 404835 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_RHS_Body 22.04% 216520 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_Advect_Body 8.22% 80341 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_constraints_Body 7.26% 71052 cactusBSSN_r libm-2.26.so __ieee754_exp_avx 5.86% 57419 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBaseDtLapseShift_Body 4.89% 48084 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBase_Body 2.92% 28579 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_InitRHS_Body 2.38% 23520 cactusBSSN_r cactusBSSN_r_peak MoL_LinearCombination r263875 (bad) perf stat and report (note that branch misses grew by 6%): Performance counter stats for 'numactl -C 0 -l specinvoke': 254984.828108 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 92949 page-faults:u # 0.365 K/sec 809505457529 cycles:u # 3.175 GHz (83.33%) 32784020923 stalled-cycles-frontend:u # 4.05% frontend cycles idle (83.33%) 15658463714 stalled-cycles-backend:u # 1.93% backend cycles idle (83.33%) 1225361873924 instructions:u # 1.51 insn per cycle # 0.03 stalled cycles per insn (83.33%) 23461309363 branches:u # 92.011 M/sec (83.34%) 20152382 branch-misses:u # 0.09% of all branches (83.33%) 255.313012246 seconds time elapsed # Event count (approx.): 812138555051 # # Overhead Samples Command Shared Object Symbol # ........ ......... ............ ................. ......................................... # 37.54% 384512 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_RHS_Body 27.51% 282987 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_Advect_Body 7.80% 79887 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_constraints_Body 6.86% 70384 cactusBSSN_r libm-2.26.so __ieee754_exp_avx 5.73% 58878 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBaseDtLapseShift_Body 4.66% 47990 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBase_Body 2.79% 28638 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_InitRHS_Body 2.28% 23615 cactusBSSN_r cactusBSSN_r_peak MoL_LinearCombination I did the bisecting on a machine with glibc 2.26 but the issue was detected on one with glibc 2.29. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)