[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-31 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #5 from Richard Biener  ---
Hmm, we're invoking memset from libc which might use a different path on
CXL compared to Zen2?

Note that a vectorized epilogue should in no way cause additional
store-to-load forwarding penalties _but_ it might cause additional
(positive) store-to-load forwardings.

Code-generation wise the loop leaves a lot to be desired and given
we know the number of iterations is 5 the vectorized epilogue will
never be entered thus its overhead will only hurt.  Maybe CXL
branch prediction behaves better here.

Note there's room for improvement in the way we dispatch to the vectorized
epilogue.  Exiting the main vectorized loop we do

  if (do_we_need_an_epilouge)
{

then for the vectorized epilogue we do

   if (remaining-niters == 1)
 do scalar epilogue
   else
 do vector epilogue

where the complication is due to the fact that we share the scalar
epilogue loops with the loop used when the runtime cost model check
fails.

Thus the CFG with vectorized epilogue could be more optimally structured
reducing the overhead to a single jump-around.

For bwaves the other improvement opportunity is to move the memset out
of the full loop nest rather than just covering the innermost two loops.
That probably improves register allocation.

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #4 from Martin Jambor  ---
For the record, on AMD Zen2 at least, SPEC 2006 410.bwaves also runs
about 12% faster with --param vect-epilogues-nomask=0 (and otherwise
with -Ofast -march=native -mtune=native).

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #3 from Martin Jambor  ---
One more data point, binary compiled for cascadelake does not run on
Zen2, but one for znver2 runs on Cascade Lake and it makes no
difference in run-time.

If disapling epilogues helps on Intel, the difference is less than 2%.

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #2 from Martin Jambor  ---
And for completeness, LNT sees this too and has just managed to catch the
regression:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0=295.427.0;

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #1 from Martin Jambor  ---
For the record, the collected profiles both for the traditional
"cycles:u" event and (originally unintended) "ls_stlf:u" event are
below:

-Ofast -march=native -mtune=native

# Samples: 894K of event 'cycles:u'
# Event count (approx.): 735979402525
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   
.
#
67.18%599542  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
11.40%102686  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
11.37%101388  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
bi_cgstab_block_
 6.95% 62694  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 1.88% 16957  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 1.01%  9023  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned


# Samples: 769K of event 'ls_stlf:u'
# Event count (approx.): 154704730574
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   

#
94.59%612921  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
 1.83% 88259  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
 1.12% 13615  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 1.11% 43093  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 1.05%  8746  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned



-Ofast -march=native -mtune=native --param vect-epilogues-nomask=0

# Samples: 816K of event 'cycles:u'
# Event count (approx.): 671104061807
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   
.
#
64.07%521532  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
12.50%102670  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
12.39%100777  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
bi_cgstab_block_
 7.60% 62641  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 2.06% 16925  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 1.17%  9531  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned

# Samples: 705K of event 'ls_stlf:u'
# Event count (approx.): 55009340780
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   
..
#
86.26%532930  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
 5.15% 88270  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
 3.17% 13696  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 3.06% 57149  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 1.59%  9226  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned