https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197
Bug ID: 80197 Summary: pgo dramatically pessimizes scimark2 MonteCarlo benchmark Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- Created attachment 41053 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41053&action=edit self contained benchmark of scimark2 MC while chasing the regression I then found identified and solved in #79389 I discovered that pgo manages to do much worse than the regression above. The symptom is the same: a huge increase in branch-miss. This is not a regression: it is the same at least since gcc5.3 Attached a self contained single file, copy of scimark2 MC, and a couple of scripts to compile and run it just tar -xzf fullMC.tgz cd fullMC # standard compilation -O2 -O3 ./runit # same with pgo passes ./dopgo or just do [innocent@vinavx3 fullMC]$ rm -rf pgo/* ; c++ -O3 fullMC.c -g -fprofile-generate=pgo ; time ./a.out 1.848u 0.000s 0:01.85 99.4% 0+0k 0+8io 0pf+0w [innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g -fprofile-use=./pgo ; time ./a.out 0.967u 0.001s 0:00.96 100.0% 0+0k 0+0io 0pf+0w [innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g; time ./a.out 0.328u 0.000s 0:00.32 100.0% 0+0k 0+0io 0pf+0w for reference: cat dopgo cat /proc/cpuinfo | grep name | head -n 1 gcc -v rm -rf pgo/*;gcc -O2 fullMC.c -g -fprofile-generate=pgo; ./a.out gcc -O2 fullMC.c -g -fprofile-use=pgo; ./a.out perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out rm -rf pgo/*;gcc -O3 fullMC.c -g -fprofile-generate=pgo; ./a.out gcc -O3 fullMC.c -g -fprofile-use=pgo; ./a.out perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out on my machine the result is # standard compilation [innocent@vinavx3 fullMC]$ ./runit model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) gcc -O2 fullMC.c -g real 0m0.489s user 0m0.485s sys 0m0.002s Performance counter stats for './a.out': 486.303424 task-clock (msec) # 0.999 CPUs utilized 1901271534 cycles # 3.910 GHz 6403589598 instructions # 3.37 insn per cycle 700683389 branches # 1440.836 M/sec 13582 branch-misses # 0.00% of all branches 0.486571089 seconds time elapsed gcc -O3 fullMC.c -g real 0m0.330s user 0m0.330s sys 0m0.000s Performance counter stats for './a.out': 327.385696 task-clock (msec) # 0.999 CPUs utilized 1279958668 cycles # 3.910 GHz 5009002909 instructions # 3.91 insn per cycle 306481761 branches # 936.149 M/sec 10805 branch-misses # 0.00% of all branches 0.327637485 seconds time elapsed // pro generation and use (perf after use...) [innocent@vinavx3 fullMC]$ ./dopgo model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) Performance counter stats for './a.out': 964.399833 task-clock (msec) # 1.000 CPUs utilized 3770455888 cycles # 3.910 GHz 5007987488 instructions # 1.33 insn per cycle 816525627 branches # 846.667 M/sec 88982233 branch-misses # 10.90% of all branches 0.964699603 seconds time elapsed Performance counter stats for './a.out': 964.540691 task-clock (msec) # 1.000 CPUs utilized 3771010753 cycles # 3.910 GHz 5007957589 instructions # 1.33 insn per cycle 816522043 branches # 846.540 M/sec 88992086 branch-misses # 10.90% of all branches 0.964758684 seconds time elapsed