[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 --- Comment #8 from Martin Jambor --- Do I understand correctly that this is fixed?
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 --- Comment #7 from kugan at gcc dot gnu.org --- Author: kugan Date: Mon Nov 12 23:43:56 2018 New Revision: 266039 URL: https://gcc.gnu.org/viewcvs?rev=266039=gcc=rev Log: gcc/ChangeLog: 2018-11-13 Kugan Vivekanandarajah PR middle-end/86677 PR middle-end/87528 * tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN POPCOUNT as expensive when backend does not define it. gcc/testsuite/ChangeLog: 2018-11-13 Kugan Vivekanandarajah PR middle-end/86677 PR middle-end/87528 * g++.dg/tree-ssa/pr86544.C: Run only for target supporting popcount pattern. * gcc.dg/tree-ssa/popcount.c: Likewise. * gcc.dg/tree-ssa/popcount2.c: Likewise. * gcc.dg/tree-ssa/popcount3.c: Likewise. * gcc.target/aarch64/popcount4.c: New test. * lib/target-supports.exp (check_effective_target_popcountl): New. Added: trunk/gcc/testsuite/gcc.target/aarch64/popcount4.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c trunk/gcc/testsuite/lib/target-supports.exp trunk/gcc/tree-scalar-evolution.c
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 --- Comment #6 from rguenther at suse dot de --- On Mon, 8 Oct 2018, jamborm at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 > > --- Comment #5 from Martin Jambor --- > (In reply to Richard Biener from comment #3) > > Can you point me to the source for which we generate the popcount call(s)? > > It might be not final value replacement but instead code-generating a niter > > analysis result. > > It turns out it is as simple as: > > > typedef unsigned long long BITBOARD; > > int PopCount (BITBOARD b) { > int c = 0; > > while (b) { > b &= b - 1; > c++; > } > > return c; > } > > with -O3. Trunk generates a call to __popcountdi2 when GCC 7 (or > trunk with -march=skylake) does not. OK, I see. They have that PopCount and also ThickPopCount which appearantly is supposed to handle the high-density case better. They have also open-coded ffs() & friends. I guess without any idea of 'b' we need to assume that 'b' has almost no bits set and thus the loop will be fast(er than a call). This was also noted elsewhere and a GCC discussion circled around the fact that we introduce popcount() into other contexts (the above is final value replacement). So that raises the question whether we should simply open-code popcount on the target? I wonder if you can benchmark using ThickPopCount everywhere (for a non-loopy replacement). Note currently final value replacement uses expression_expensive_p which eventually just checks is_inexpensive_builtin which suffers from the very same issue (in theory). Patching is_inexpensive_builtin to check whether we have an optab might do the trick (with other side-effects, like on inlining, of course).
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 --- Comment #5 from Martin Jambor --- (In reply to Richard Biener from comment #3) > Can you point me to the source for which we generate the popcount call(s)? > It might be not final value replacement but instead code-generating a niter > analysis result. It turns out it is as simple as: typedef unsigned long long BITBOARD; int PopCount (BITBOARD b) { int c = 0; while (b) { b &= b - 1; c++; } return c; } with -O3. Trunk generates a call to __popcountdi2 when GCC 7 (or trunk with -march=skylake) does not.
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 --- Comment #4 from Alexander Monakov --- (In reply to Richard Biener from comment #3) > Can you point me to the source for which we generate the popcount call(s)? > It might be not final value replacement but instead code-generating a niter > analysis result. Sorry, the reference to final value replacement was just in order to draw a parallel. I was pointing out a famous historical precedent where GCC had to avoid replacing a cheap loop with a costly libgcc call.
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC||rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener --- Can you point me to the source for which we generate the popcount call(s)? It might be not final value replacement but instead code-generating a niter analysis result. Generally I agree that libgcc popcount should be avoided but it's not always easy. Eventually libgcc popcount could do a ifarch() runtime check to use the native popcount instruction. Also that we only have popcountDI is of course less than optimal given we loop over all bytes...
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov --- x86 has native popcount only with -msse4.2, otherwise popcount(int) first zero-extends to 64-bit, then calls __popcountdi2 (64-bit libgcc popcount). If the original code computes popcount on narrow types, or has only a few non-zero bits, it can be expected that libgcc replacement is slower. Even if size-wise popcount detection is an optimization, speed-wise GCC probably should avoid replacing a simple loop with a libgcc call (just like final value replacement avoids replacing a loop with computations involving modulus/division).
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 Martin Jambor changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #1 from Martin Jambor --- It seems that the machine does not like the newly generated calls into libgcc for popcount. The profile of r262486 (_slow variant) and the one immediately preceding it (the _fast variant) is: $ perf report -n --percent-limit=2 | cat # Overhead Samples Command Shared Object Symbol # ... . . # 6.15%187930 deepsjeng_r_slow deepsjeng_r feval 5.88%179434 deepsjeng_r_fast deepsjeng_r feval 5.56%169734 deepsjeng_r_fast deepsjeng_r search 5.42%165581 deepsjeng_r_slow deepsjeng_r search 5.19%158575 deepsjeng_r_slow deepsjeng_r ProbeTT 5.16%157546 deepsjeng_r_fast deepsjeng_r ProbeTT 4.74%144696 deepsjeng_r_slow deepsjeng_r qsearch 4.72%144193 deepsjeng_r_fast deepsjeng_r qsearch 2.76% 84389 deepsjeng_r_slow libgcc_s.so __popcountdi2 2.75% 83936 deepsjeng_r_fast deepsjeng_r see 2.73% 83307 deepsjeng_r_slow deepsjeng_r see 2.67% 81614 deepsjeng_r_slow deepsjeng_r order_moves 2.62% 80077 deepsjeng_r_fast deepsjeng_r order_moves 2.49% 76087 deepsjeng_r_slow deepsjeng_r FindFirstRemove 2.47% 75346 deepsjeng_r_fast deepsjeng_r FindFirstRemove 2.03% 61888 deepsjeng_r_fast deepsjeng_r make 2.03% 61861 deepsjeng_r_slow deepsjeng_r make The profile for r262864 (marked again as _slow below) and its immediate predecessor (marked _fast) is: # Overhead Samples Command Shared Object Symbol # ... . . # 5.87%192681 deepsjeng_r_slow deepsjeng_r feval 5.74%188254 deepsjeng_r_fast deepsjeng_r feval 5.48%179850 deepsjeng_r_slow libgcc_s.so __popcountdi2 5.17%169671 deepsjeng_r_slow deepsjeng_r search 5.04%165438 deepsjeng_r_fast deepsjeng_r search 4.83%158368 deepsjeng_r_fast deepsjeng_r ProbeTT 4.82%158096 deepsjeng_r_slow deepsjeng_r ProbeTT 4.44%145659 deepsjeng_r_fast deepsjeng_r qsearch 4.39%144117 deepsjeng_r_slow deepsjeng_r qsearch 2.56% 84085 deepsjeng_r_fast libgcc_s.so __popcountdi2 2.55% 83853 deepsjeng_r_slow deepsjeng_r see 2.55% 83653 deepsjeng_r_fast deepsjeng_r see 2.54% 83383 deepsjeng_r_fast deepsjeng_r order_moves 2.44% 80246 deepsjeng_r_slow deepsjeng_r order_moves 2.31% 75966 deepsjeng_r_fast deepsjeng_r FindFirstRemove 2.30% 75575 deepsjeng_r_slow deepsjeng_r FindFirstRemove Again, let me emphasize this is all about generic march/mtune, native march/mtune is almost 3% faster than GCC 8.