tuning on Zen CPUs

jamborm at gcc dot gnu.org Wed, 17 Apr 2019 04:25:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90128


            Bug ID: 90128
           Summary: 507.cactuBSSN_r is 9-11% slower at -Ofast and native
                    march/tuning on Zen CPUs
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

In my own measurements, 507.cactuBSSN_r is about 9.4% slower on an AMD
Zen CPU when compiled with GCC 9 with -Ofast and native march/mtune
than when it si compiled with GCC 8.  LNT currently even shows 11.4%
regression: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch

I have done some bisecting and the slowdown happened in three steps.
First, the benchmark slowed by about 2% at some point before r262510
which I have not tracked down yet. Second, it then dived 3% with
r263874 but this seems to be some code-placement issue again because
the assembly of the functions which gained perf samples has not
changed in that revision and perf reported stalled-cycles-frontend
went from 4.58% to 5.02%.

However, the third regression was caused by the immediately following
revision r263875, the difference is 4.5% (7.5% is compared to GCC 8
run-time) while perf reported stalled-cycles-frontend were only 4.05%.


r263872 (good) perf stat and report:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     238848.989836      task-clock:u (msec)       #    0.999 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
             92923      page-faults:u             #    0.389 K/sec              
      758195547230      cycles:u                  #    3.174 GHz               
      (83.33%)
       34727040659      stalled-cycles-frontend:u #    4.58% frontend cycles
idle     (83.33%)
       15457735869      stalled-cycles-backend:u  #    2.04% backend cycles
idle      (83.33%)
     1225370192228      instructions:u            #    1.62  insn per cycle     
                                                  #    0.03  stalled cycles per
insn  (83.33%)
       23031544594      branches:u                #   96.427 M/sec             
      (83.34%)
          18985096      branch-misses:u           #    0.08% of all branches   
      (83.33%)

     239.158442295 seconds time elapsed

 # Event count (approx.): 758374775503
 #
 # Overhead    Samples  Command       Shared Object      Symbol                 
 # ........  .........  ............  ................. 
.........................................
 #
     40.51%     387505  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_RHS_Body
     22.34%     214782  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_Advect_Body
      8.42%      80594  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_constraints_Body
      7.40%      70897  cactusBSSN_r  libm-2.26.so       __ieee754_exp_avx
      5.77%      55393  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_convertToADMBaseDtLapseShift_Body
      4.99%      47952  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_convertToADMBase_Body
      2.98%      28573  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_InitRHS_Body
      2.44%      23623  cactusBSSN_r  cactusBSSN_r_peak  MoL_LinearCombination


r263874 (worse) perf stat and report:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     244036.523777      task-clock:u (msec)       #    0.999 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
             93013      page-faults:u             #    0.381 K/sec              
      774757677736      cycles:u                  #    3.175 GHz               
      (83.33%)
       38930288027      stalled-cycles-frontend:u #    5.02% frontend cycles
idle     (83.33%)
       15508961324      stalled-cycles-backend:u  #    2.00% backend cycles
idle      (83.34%)
     1226167776333      instructions:u            #    1.58  insn per cycle     
                                                  #    0.03  stalled cycles per
insn  (83.33%)
       23218262947      branches:u                #   95.143 M/sec             
      (83.33%)
          18890390      branch-misses:u           #    0.08% of all branches   
      (83.33%)

     244.344340731 seconds time elapsed


 # Samples: 979K of event 'cycles'
 # Event count (approx.): 775138268715
 #
 # Overhead    Samples  Command       Shared Object      Symbol                 
 # ........  .........  ............  ................. 
.........................................
 #
     41.43%     404835  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_RHS_Body
     22.04%     216520  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_Advect_Body
      8.22%      80341  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_constraints_Body
      7.26%      71052  cactusBSSN_r  libm-2.26.so       __ieee754_exp_avx
      5.86%      57419  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_convertToADMBaseDtLapseShift_Body
      4.89%      48084  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_convertToADMBase_Body
      2.92%      28579  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_InitRHS_Body
      2.38%      23520  cactusBSSN_r  cactusBSSN_r_peak  MoL_LinearCombination


r263875 (bad) perf stat and report (note that branch misses grew by 6%):

  Performance counter stats for 'numactl -C 0 -l specinvoke':

     254984.828108      task-clock:u (msec)       #    0.999 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
             92949      page-faults:u             #    0.365 K/sec              
      809505457529      cycles:u                  #    3.175 GHz               
      (83.33%)
       32784020923      stalled-cycles-frontend:u #    4.05% frontend cycles
idle     (83.33%)
       15658463714      stalled-cycles-backend:u  #    1.93% backend cycles
idle      (83.33%)
     1225361873924      instructions:u            #    1.51  insn per cycle     
                                                  #    0.03  stalled cycles per
insn  (83.33%)
       23461309363      branches:u                #   92.011 M/sec             
      (83.34%)
          20152382      branch-misses:u           #    0.09% of all branches   
      (83.33%)

     255.313012246 seconds time elapsed

 # Event count (approx.): 812138555051
 #
 # Overhead    Samples  Command       Shared Object      Symbol                 
 # ........  .........  ............  ................. 
.........................................
 #
     37.54%     384512  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_RHS_Body
     27.51%     282987  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_Advect_Body
      7.80%      79887  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_constraints_Body
      6.86%      70384  cactusBSSN_r  libm-2.26.so       __ieee754_exp_avx
      5.73%      58878  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_convertToADMBaseDtLapseShift_Body
      4.66%      47990  cactusBSSN_r  cactusBSSN_r_peak 
ML_BSSN_convertToADMBase_Body
      2.79%      28638  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_InitRHS_Body
      2.28%      23615  cactusBSSN_r  cactusBSSN_r_peak  MoL_LinearCombination

I did the bisecting on a machine with glibc 2.26 but the issue was
detected on one with glibc 2.29.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug tree-optimization/90128] New: 507.cactuBSSN_r is 9-11% slower at -Ofast and native march/tuning on Zen CPUs

Reply via email to