https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81441
Bug ID: 81441 Summary: slowdown due to -fpeel-loops and -ftracer added by -fprofile-use Product: gcc Version: 5.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: Joost.VandeVondele at mat dot ethz.ch Target Milestone: --- For our code, we see a slowdown (3%-7% depending on the user reporting) due to the options -fpeel-loops and -ftracer added by default when using -fprofile-use. The code is stockfish, which is presumably the strongest open source chess engine, and part of benchmark suites such as https://openbenchmarking.org/test/pts/stockfish The same behaviour has been observed for gcc versions from 4.X to 7.1 so it is not some recent regression and quite persistent. (Discussions in https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/YzV_fG7ejR4 and https://github.com/official-stockfish/Stockfish/pull/1165 ) It is not easy for me to pinpoint the location in the code that is affected most (despite the code being only ~5000 lines of C++). I tried differential profiling with perf, but didn't get profiles that made sense to me. It is easy to reproduce, by testing two successive git commits where the change of options in the Makefile is the only difference: git clone https://github.com/official-stockfish/Stockfish.git cd Stockfish/src/ # version with -fprofile-use -fno-peel-loops -fno-tracer # ====================================================== git checkout c8e5384c3a4a5d9ac709c9b50954907a7f07109c make clean && make -j ARCH=x86-64-modern profile-build ./stockfish bench 128 1 16 default depth 2>&1 | grep 'Total time (ms)' # (locally reports Total time (ms) : 9947) #version with just -fprofile-use #======================================================= git checkout 0371a8f8c4a043cb3e7d08b5b8e7d08d49f28324 make clean && make -j ARCH=x86-64-modern profile-build ./stockfish bench 128 1 16 default depth 2>&1 | grep 'Total time (ms)' # (locally reports Total time (ms) : 10456) So '-fprofile-use -fno-peel-loops -fno-tracer' is 5% faster than '-fprofile-use' in my case. Let me know if I can provide more info. The length of the benchmarks can be adjusted easily by changing the '16' in the bench command to smaller (shorter) or larger (longer) numbers (time increases/decreases exponentially, change in steps of 1 to have ~2x change).