[Bug rtl-optimization/89853] Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC

2019-03-28 Thread bergner at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89853

--- Comment #7 from Peter Bergner  ---
(In reply to Martin Jambor from comment #6)
> Hi, the assembly of the most affected function does not change at all, just
> its offset (is 0x10 bytes bigger).  Aligning the loops in the function a bit
> more avoids most of the slowdown but not quite all of it.  In any event,
> this is a microarchitectural problem that we probably cannot do anything
> about.  Sorry for the noise, I will check for this the next time before I
> report a problem.

We've seen similar issues on POWER, where a particular revision causes slight
size changes in a function that changes the function offset of some other later
function and that causes a performance change.  Unfortunately, just increasing
function alignment to eliminate that has other unintended performance issues.

Thanks for isolating the issue.

[Bug rtl-optimization/89853] Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC

2019-03-28 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89853

Martin Jambor  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WONTFIX

--- Comment #6 from Martin Jambor  ---
Hi, the assembly of the most affected function does not change at all, just its
offset (is 0x10 bytes bigger).  Aligning the loops in the function a bit more
avoids most of the slowdown but not quite all of it.  In any event, this is a
microarchitectural problem that we probably cannot do anything about.  Sorry
for the noise, I will check for this the next time before I report a problem.

[Bug rtl-optimization/89853] Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC

2019-03-28 Thread bergner at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89853

--- Comment #5 from Peter Bergner  ---
(In reply to Martin Liška from comment #4)
> Just for the record, my Ryzen machine periodic tester probably improved due
> to the revision:
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=158.377.0&plot.1=41.
> 377.0&plot.2=70.377.0&plot.3=31.377.0
> 
> As seen, it's now about 5% faster than GCC8 branch.

Very interesting, thanks for that!  Since the two of you both used -O2 and
generic tuning (ie, same code), that would tend to agree with my speculation
that this is an AMD EPYC specific pipeline issue/hazard/... we're unluckily
hitting.  Agreed?  If so, I'm not sure we can really blame my patch, but if
someone could narrow down what the exact issue is that is causing the slowdown,
maybe we can mitigate it somehow.

[Bug rtl-optimization/89853] Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC

2019-03-28 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89853

Martin Liška  changed:

   What|Removed |Added

 CC||marxin at gcc dot gnu.org

--- Comment #4 from Martin Liška  ---
Just for the record, my Ryzen machine periodic tester probably improved due to
the revision:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=158.377.0&plot.1=41.377.0&plot.2=70.377.0&plot.3=31.377.0

As seen, it's now about 5% faster than GCC8 branch.

[Bug rtl-optimization/89853] Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC

2019-03-27 Thread bergner at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89853

--- Comment #3 from Peter Bergner  ---
I don't have access to that type of machine and honestly don't know the ISA
well enough to know the differences between what runs well and what doesn't
just by looking at the code.  Can you point out some code/function where the
assembler code is worse?

The patch you bisected to only removes unneeded conflicts in the conflict
graph, which gives the allocators more freedom, which in general is a good
thing.  That said, since these are all heuristics built on top of heuristics,
it's not impossible that giving more freedom could lead to worse code.

My guess is though, we're probably tickling a AMD specific hardware pipeline
feature, since you said you don't see the same thing on Intel.

[Bug rtl-optimization/89853] Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC

2019-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89853

--- Comment #2 from Martin Jambor  ---
Doh, yes, copy-paste error, sorry.  The data should have been:

FAST:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

 495413.105450  task-clock:u (msec)   #0.999 CPUs utilized  
 0  context-switches:u#0.000 K/sec  
 0  cpu-migrations:u  #0.000 K/sec  
 80572  page-faults:u #0.163 K/sec  
 1573525941814  cycles:u  #3.176 GHz   
  (83.33%)
   56730573392  stalled-cycles-frontend:u #3.61% frontend cycles
idle (83.33%)
  397644125819  stalled-cycles-backend:u  #   25.27% backend cycles
idle  (83.33%)
 5157395976259  instructions:u#3.28  insn per cycle 
  #0.08  stalled cycles per
insn  (83.33%)
  421019689027  branches:u#  849.836 M/sec 
  (83.33%)
   10705813341  branch-misses:u   #2.54% of all branches   
  (83.33%)

 495.869208013 seconds time elapsed

# Event count (approx.): 1576108148398
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   
.
#
14.20%282290  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_satd_8x4
11.19%222403  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
get_ref
10.82%215061  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x4_16x16
 7.00%139082  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_16x16
 6.11%121470  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x3_16x16
 5.89%116939  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x4_8x8
 5.09%101266  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
quant_4x4
 4.10% 81471  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
mc_chroma
 2.47% 49122  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x3_8x8
 2.21% 43928  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
sub4x4_dct
 2.14% 42598  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
pixel_hadamard_ac



SLOW:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

 526858.531112  task-clock:u (msec)   #0.999 CPUs utilized  
 0  context-switches:u#0.000 K/sec  
 0  cpu-migrations:u  #0.000 K/sec  
 81064  page-faults:u #0.154 K/sec  
 1673634535742  cycles:u  #3.177 GHz   
  (83.33%)
   64458929239  stalled-cycles-frontend:u #3.85% frontend cycles
idle (83.33%)
  397586117982  stalled-cycles-backend:u  #   23.76% backend cycles
idle  (83.33%)
 5157346862311  instructions:u#3.08  insn per cycle 
  #0.08  stalled cycles per
insn  (83.33%)
  421082988475  branches:u#  799.234 M/sec 
  (83.33%)
   14226205709  branch-misses:u   #3.38% of all branches   
  (83.33%)

 527.353829377 seconds time elapsed


 # Samples: 2M of event 'cycles'
 # Event count (approx.): 1675655436335
 #
 # Overhead   Samples  Command  Shared Object
Symbol   
 #     ...   
.
 #
14.13%298519  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x4_16x16
13.43%283793  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_satd_8x4
11.56%244196  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
get_ref
 7.17%151589  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x3_16x16
 6.29%132936  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_16x16
 5.28%111517  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x4_8x8
 4.84%102317  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
quant_4x4
 3.86% 81563  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
mc_chroma
 2.57% 54233  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
x264_pixel_sad_x3_8x8
 2.08% 43964  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
sub4x4_dct
 2.01% 42520  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.]
pixel_hadamard_ac

[Bug rtl-optimization/89853] Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC

2019-03-27 Thread bergner at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89853

--- Comment #1 from Peter Bergner  ---
Cut and paste error?  The two data sets look the same to me...or am I missing
something?