[Bug rtl-optimization/80197] New: pgo dramatically pessimizes scimark2 MonteCarlo benchmark

vincenzo.innocente at cern dot ch Sun, 26 Mar 2017 06:43:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197


            Bug ID: 80197
           Summary: pgo dramatically pessimizes scimark2 MonteCarlo
                    benchmark
           Product: gcc
           Version: 7.0.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

Created attachment 41053
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41053&action=edit
self contained benchmark of scimark2 MC

while chasing the regression I then found identified and solved in #79389
I discovered that pgo manages to do much worse than the regression above.
The symptom is the same: a huge increase in branch-miss.
This is not a regression: it is the same at least since gcc5.3
Attached a self contained single file, copy of scimark2 MC, and a couple of
scripts to compile and run it

just
tar -xzf fullMC.tgz
cd fullMC
# standard compilation -O2 -O3
./runit 
# same with pgo passes
./dopgo

or just do
[innocent@vinavx3 fullMC]$ rm -rf pgo/* ; c++ -O3 fullMC.c -g
-fprofile-generate=pgo ; time ./a.out
1.848u 0.000s 0:01.85 99.4%     0+0k 0+8io 0pf+0w
[innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g -fprofile-use=./pgo ; time
./a.out
0.967u 0.001s 0:00.96 100.0%    0+0k 0+0io 0pf+0w
[innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g; time ./a.out
0.328u 0.000s 0:00.32 100.0%    0+0k 0+0io 0pf+0w


for reference:
cat dopgo
cat /proc/cpuinfo | grep name | head -n 1
gcc -v
rm -rf pgo/*;gcc -O2 fullMC.c -g -fprofile-generate=pgo; ./a.out
gcc -O2 fullMC.c -g -fprofile-use=pgo; ./a.out
perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses
./a.out
rm -rf pgo/*;gcc -O3 fullMC.c -g -fprofile-generate=pgo; ./a.out
gcc -O3 fullMC.c -g -fprofile-use=pgo; ./a.out
perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses
./a.out


on my machine the result is
# standard compilation
[innocent@vinavx3 fullMC]$ ./runit 
model name      : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) 
gcc -O2 fullMC.c -g

real    0m0.489s
user    0m0.485s
sys     0m0.002s

 Performance counter stats for './a.out':

        486.303424      task-clock (msec)         #    0.999 CPUs utilized      
        1901271534      cycles                    #    3.910 GHz                
        6403589598      instructions              #    3.37  insn per cycle     
         700683389      branches                  # 1440.836 M/sec              
             13582      branch-misses             #    0.00% of all branches    

       0.486571089 seconds time elapsed

gcc -O3 fullMC.c -g

real    0m0.330s
user    0m0.330s
sys     0m0.000s

 Performance counter stats for './a.out':

        327.385696      task-clock (msec)         #    0.999 CPUs utilized      
        1279958668      cycles                    #    3.910 GHz                
        5009002909      instructions              #    3.91  insn per cycle     
         306481761      branches                  #  936.149 M/sec              
             10805      branch-misses             #    0.00% of all branches    

       0.327637485 seconds time elapsed


// pro generation and use (perf after use...)
[innocent@vinavx3 fullMC]$ ./dopgo 
model name      : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) 

 Performance counter stats for './a.out':

        964.399833      task-clock (msec)         #    1.000 CPUs utilized      
        3770455888      cycles                    #    3.910 GHz                
        5007987488      instructions              #    1.33  insn per cycle     
         816525627      branches                  #  846.667 M/sec              
          88982233      branch-misses             #   10.90% of all branches    

       0.964699603 seconds time elapsed


 Performance counter stats for './a.out':

        964.540691      task-clock (msec)         #    1.000 CPUs utilized      
        3771010753      cycles                    #    3.910 GHz                
        5007957589      instructions              #    1.33  insn per cycle     
         816522043      branches                  #  846.540 M/sec              
          88992086      branch-misses             #   10.90% of all branches    

       0.964758684 seconds time elapsed

[Bug rtl-optimization/80197] New: pgo dramatically pessimizes scimark2 MonteCarlo benchmark

Reply via email to