mtune than with generic ones

jamborm at gcc dot gnu.org Wed, 24 Apr 2019 08:57:25 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90234


            Bug ID: 90234
           Summary: 503.bwaves_r is 6% slower on Zen CPUs at -Ofast with
                    native march/mtune than with generic ones
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: hubicka at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

In my experiments on an EPYC CPU and GCC trunk r270364, 503.bwaves_r
is over 6% slower at -Ofast when I supply -march=native =mtune=native
than when I compile for generic x86_64.  LNT sees 3.55% regression
too: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/tuning

perf stat and report of the generic (fast) binary run:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     240411.714022      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             35189      page-faults:u             #    0.146 K/sec
      757727387955      cycles:u                  #    3.152 GHz               
      (83.32%)
       40175950077      stalled-cycles-frontend:u #    5.30% frontend cycles
idle     (83.31%)
       91872393105      stalled-cycles-backend:u  #   12.12% backend cycles
idle      (83.37%)
     2177387522561      instructions:u            #    2.87  insn per cycle
                                                  #    0.04  stalled cycles per
insn  (83.32%)
       98299602685      branches:u                #  408.880 M/sec             
      (83.32%)
         131591436      branch-misses:u           #    0.13% of all branches   
      (83.36%)

     240.668052943 seconds time elapsed

 # Samples: 960K of event 'cycles'
 # Event count (approx.): 755626377551
 #
 # Overhead   Samples  Command   Shared Object      Symbol
 # ........  ........  ........  .................  ........................
 #
     62.10%    595840  bwaves_r  bwaves_r_peak-gen  mat_times_vec_
     13.91%    133958  bwaves_r  bwaves_r_peak-gen  shell_
     12.40%    119012  bwaves_r  bwaves_r_peak-gen  bi_cgstab_block_
      7.81%     75246  bwaves_r  bwaves_r_peak-gen  jacobian_
      2.11%     20290  bwaves_r  bwaves_r_peak-gen  flux_
      1.27%     12217  bwaves_r  libc-2.29.so       __memset_avx2_unaligned



perf stat and report of the native (slow) binary run:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     255695.249393      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             35604      page-faults:u             #    0.139 K/sec
      800619530480      cycles:u                  #    3.131 GHz               
      (83.32%)
       77320365388      stalled-cycles-frontend:u #    9.66% frontend cycles
idle     (83.34%)
       93389410778      stalled-cycles-backend:u  #   11.66% backend cycles
idle      (83.33%)
     1821704428841      instructions:u            #    2.28  insn per cycle
                                                  #    0.05  stalled cycles per
insn  (83.32%)
       99885762475      branches:u                #  390.644 M/sec             
      (83.34%)
         130710907      branch-misses:u           #    0.13% of all branches   
      (83.34%)

     255.958363704 seconds time elapsed

 # Samples: 1M of event 'cycles'
 # Event count (approx.): 804011318580
 #
 # Overhead   Samples  Command   Shared Object      Symbol
 # ........  ........  ........  .................  ........................
 #
     64.87%    662574  bwaves_r  bwaves_r_peak-nat  mat_times_vec_
     12.75%    130756  bwaves_r  bwaves_r_peak-nat  shell_
     11.48%    117266  bwaves_r  bwaves_r_peak-nat  bi_cgstab_block_
      7.45%     76415  bwaves_r  bwaves_r_peak-nat  jacobian_
      1.92%     19701  bwaves_r  bwaves_r_peak-nat  flux_
      1.34%     13662  bwaves_r  libc-2.29.so       __memset_avx2_unaligned



Examining the slow mat_times_vec_ further, perf claims that the
following loop is the most sample-heavy:

  0.01 |6c0:+->vmulpd (%r8,%rax,1),%xmm9,%xmm0
  4.34 |    |  vandnp (%r10,%rax,1),%xmm2,%xmm1
  0.83 |    |  vfmadd (%r15,%rax,1),%xmm11,%xmm1
  1.35 |    |  vfmadd (%r14,%rax,1),%xmm10,%xmm0
  5.85 |    |  vaddpd %xmm1,%xmm0,%xmm1
  7.41 |    |  vmulpd (%rsi,%rax,1),%xmm7,%xmm0
  2.19 |    |  vfmadd (%rdi,%rax,1),%xmm8,%xmm0
  3.97 |    |  vmovap %xmm0,%xmm12
  0.07 |    |  vmulpd (%r11,%rax,1),%xmm5,%xmm0
  0.93 |    |  vfmadd (%rcx,%rax,1),%xmm6,%xmm0
  1.92 |    |  vaddpd %xmm12,%xmm0,%xmm0
  6.34 |    |  vaddpd %xmm1,%xmm0,%xmm0
  9.58 |    |  vmovup %xmm0,(%r10,%rax,1)
  0.49 |    |  add    $0x10,%rax
  0.05 |    |  cmp    %rax,0x38(%rsp)
  0.02 |    +--jne    6c0

Objdump perhaps gives a better idea about exactly which instructions
these are:

  4011c0:  c4 c1 31 59 04 00   vmulpd (%r8,%rax,1),%xmm9,%xmm0
  4011c6:  c4 c1 68 55 0c 02   vandnps (%r10,%rax,1),%xmm2,%xmm1
  4011cc:  c4 c2 a1 b8 0c 07   vfmadd231pd (%r15,%rax,1),%xmm11,%xmm1
  4011d2:  c4 c2 a9 b8 04 06   vfmadd231pd (%r14,%rax,1),%xmm10,%xmm0
  4011d8:  c5 f9 58 c9         vaddpd %xmm1,%xmm0,%xmm1
  4011dc:  c5 c1 59 04 06      vmulpd (%rsi,%rax,1),%xmm7,%xmm0
  4011e1:  c4 e2 b9 b8 04 07   vfmadd231pd (%rdi,%rax,1),%xmm8,%xmm0
  4011e7:  c5 78 28 e0         vmovaps %xmm0,%xmm12
  4011eb:  c4 c1 51 59 04 03   vmulpd (%r11,%rax,1),%xmm5,%xmm0
  4011f1:  c4 e2 c9 b8 04 01   vfmadd231pd (%rcx,%rax,1),%xmm6,%xmm0
  4011f7:  c4 c1 79 58 c4      vaddpd %xmm12,%xmm0,%xmm0
  4011fc:  c5 f9 58 c1         vaddpd %xmm1,%xmm0,%xmm0
  401200:  c4 c1 78 11 04 02   vmovups %xmm0,(%r10,%rax,1)
  401206:  48 83 c0 10         add    $0x10,%rax
  40120a:  48 39 44 24 38      cmp    %rax,0x38(%rsp)
  40120f:  75 af               jne    4011c0 <mat_times_vec_+0x6c0>

I did a quick experiment with completely disabling FMA generation but
it did not help.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug target/90234] New: 503.bwaves_r is 6% slower on Zen CPUs at -Ofast with native march/mtune than with generic ones

Reply via email to