https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90234
Bug ID: 90234 Summary: 503.bwaves_r is 6% slower on Zen CPUs at -Ofast with native march/mtune than with generic ones Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux In my experiments on an EPYC CPU and GCC trunk r270364, 503.bwaves_r is over 6% slower at -Ofast when I supply -march=native =mtune=native than when I compile for generic x86_64. LNT sees 3.55% regression too: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/tuning perf stat and report of the generic (fast) binary run: Performance counter stats for 'numactl -C 0 -l specinvoke': 240411.714022 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 35189 page-faults:u # 0.146 K/sec 757727387955 cycles:u # 3.152 GHz (83.32%) 40175950077 stalled-cycles-frontend:u # 5.30% frontend cycles idle (83.31%) 91872393105 stalled-cycles-backend:u # 12.12% backend cycles idle (83.37%) 2177387522561 instructions:u # 2.87 insn per cycle # 0.04 stalled cycles per insn (83.32%) 98299602685 branches:u # 408.880 M/sec (83.32%) 131591436 branch-misses:u # 0.13% of all branches (83.36%) 240.668052943 seconds time elapsed # Samples: 960K of event 'cycles' # Event count (approx.): 755626377551 # # Overhead Samples Command Shared Object Symbol # ........ ........ ........ ................. ........................ # 62.10% 595840 bwaves_r bwaves_r_peak-gen mat_times_vec_ 13.91% 133958 bwaves_r bwaves_r_peak-gen shell_ 12.40% 119012 bwaves_r bwaves_r_peak-gen bi_cgstab_block_ 7.81% 75246 bwaves_r bwaves_r_peak-gen jacobian_ 2.11% 20290 bwaves_r bwaves_r_peak-gen flux_ 1.27% 12217 bwaves_r libc-2.29.so __memset_avx2_unaligned perf stat and report of the native (slow) binary run: Performance counter stats for 'numactl -C 0 -l specinvoke': 255695.249393 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 35604 page-faults:u # 0.139 K/sec 800619530480 cycles:u # 3.131 GHz (83.32%) 77320365388 stalled-cycles-frontend:u # 9.66% frontend cycles idle (83.34%) 93389410778 stalled-cycles-backend:u # 11.66% backend cycles idle (83.33%) 1821704428841 instructions:u # 2.28 insn per cycle # 0.05 stalled cycles per insn (83.32%) 99885762475 branches:u # 390.644 M/sec (83.34%) 130710907 branch-misses:u # 0.13% of all branches (83.34%) 255.958363704 seconds time elapsed # Samples: 1M of event 'cycles' # Event count (approx.): 804011318580 # # Overhead Samples Command Shared Object Symbol # ........ ........ ........ ................. ........................ # 64.87% 662574 bwaves_r bwaves_r_peak-nat mat_times_vec_ 12.75% 130756 bwaves_r bwaves_r_peak-nat shell_ 11.48% 117266 bwaves_r bwaves_r_peak-nat bi_cgstab_block_ 7.45% 76415 bwaves_r bwaves_r_peak-nat jacobian_ 1.92% 19701 bwaves_r bwaves_r_peak-nat flux_ 1.34% 13662 bwaves_r libc-2.29.so __memset_avx2_unaligned Examining the slow mat_times_vec_ further, perf claims that the following loop is the most sample-heavy: 0.01 |6c0:+->vmulpd (%r8,%rax,1),%xmm9,%xmm0 4.34 | | vandnp (%r10,%rax,1),%xmm2,%xmm1 0.83 | | vfmadd (%r15,%rax,1),%xmm11,%xmm1 1.35 | | vfmadd (%r14,%rax,1),%xmm10,%xmm0 5.85 | | vaddpd %xmm1,%xmm0,%xmm1 7.41 | | vmulpd (%rsi,%rax,1),%xmm7,%xmm0 2.19 | | vfmadd (%rdi,%rax,1),%xmm8,%xmm0 3.97 | | vmovap %xmm0,%xmm12 0.07 | | vmulpd (%r11,%rax,1),%xmm5,%xmm0 0.93 | | vfmadd (%rcx,%rax,1),%xmm6,%xmm0 1.92 | | vaddpd %xmm12,%xmm0,%xmm0 6.34 | | vaddpd %xmm1,%xmm0,%xmm0 9.58 | | vmovup %xmm0,(%r10,%rax,1) 0.49 | | add $0x10,%rax 0.05 | | cmp %rax,0x38(%rsp) 0.02 | +--jne 6c0 Objdump perhaps gives a better idea about exactly which instructions these are: 4011c0: c4 c1 31 59 04 00 vmulpd (%r8,%rax,1),%xmm9,%xmm0 4011c6: c4 c1 68 55 0c 02 vandnps (%r10,%rax,1),%xmm2,%xmm1 4011cc: c4 c2 a1 b8 0c 07 vfmadd231pd (%r15,%rax,1),%xmm11,%xmm1 4011d2: c4 c2 a9 b8 04 06 vfmadd231pd (%r14,%rax,1),%xmm10,%xmm0 4011d8: c5 f9 58 c9 vaddpd %xmm1,%xmm0,%xmm1 4011dc: c5 c1 59 04 06 vmulpd (%rsi,%rax,1),%xmm7,%xmm0 4011e1: c4 e2 b9 b8 04 07 vfmadd231pd (%rdi,%rax,1),%xmm8,%xmm0 4011e7: c5 78 28 e0 vmovaps %xmm0,%xmm12 4011eb: c4 c1 51 59 04 03 vmulpd (%r11,%rax,1),%xmm5,%xmm0 4011f1: c4 e2 c9 b8 04 01 vfmadd231pd (%rcx,%rax,1),%xmm6,%xmm0 4011f7: c4 c1 79 58 c4 vaddpd %xmm12,%xmm0,%xmm0 4011fc: c5 f9 58 c1 vaddpd %xmm1,%xmm0,%xmm0 401200: c4 c1 78 11 04 02 vmovups %xmm0,(%r10,%rax,1) 401206: 48 83 c0 10 add $0x10,%rax 40120a: 48 39 44 24 38 cmp %rax,0x38(%rsp) 40120f: 75 af jne 4011c0 <mat_times_vec_+0x6c0> I did a quick experiment with completely disabling FMA generation but it did not help. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)