[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

--- Comment #8 from Martin Jambor  ---
Do I understand correctly that this is fixed?

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-11-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

--- Comment #7 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Mon Nov 12 23:43:56 2018
New Revision: 266039

URL: https://gcc.gnu.org/viewcvs?rev=266039=gcc=rev
Log:
gcc/ChangeLog:

2018-11-13  Kugan Vivekanandarajah  

PR middle-end/86677
PR middle-end/87528
* tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN
POPCOUNT
as expensive when backend does not define it.

gcc/testsuite/ChangeLog:

2018-11-13  Kugan Vivekanandarajah  

PR middle-end/86677
PR middle-end/87528
* g++.dg/tree-ssa/pr86544.C: Run only for target supporting popcount
pattern.
* gcc.dg/tree-ssa/popcount.c: Likewise.
* gcc.dg/tree-ssa/popcount2.c: Likewise.
* gcc.dg/tree-ssa/popcount3.c: Likewise.
* gcc.target/aarch64/popcount4.c: New test.
* lib/target-supports.exp (check_effective_target_popcountl): New.


Added:
trunk/gcc/testsuite/gcc.target/aarch64/popcount4.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c
trunk/gcc/testsuite/lib/target-supports.exp
trunk/gcc/tree-scalar-evolution.c

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-10-08 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

--- Comment #6 from rguenther at suse dot de  ---
On Mon, 8 Oct 2018, jamborm at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528
> 
> --- Comment #5 from Martin Jambor  ---
> (In reply to Richard Biener from comment #3)
> > Can you point me to the source for which we generate the popcount call(s)? 
> > It might be not final value replacement but instead code-generating a niter
> > analysis result.
> 
> It turns out it is as simple as:
> 
> 
> typedef unsigned long long BITBOARD;
> 
> int PopCount (BITBOARD b) {
> int c = 0;
> 
> while (b) {
> b &= b - 1;
> c++;
> }
> 
> return c;
> }
> 
> with -O3.  Trunk generates a call to __popcountdi2 when GCC 7 (or
> trunk with -march=skylake) does not.

OK, I see.  They have that PopCount and also ThickPopCount which
appearantly is supposed to handle the high-density case better.
They have also open-coded ffs() & friends.

I guess without any idea of 'b' we need to assume that 'b' has
almost no bits set and thus the loop will be fast(er than a call).

This was also noted elsewhere and a GCC discussion circled around
the fact that we introduce popcount() into other contexts
(the above is final value replacement).  So that raises the
question whether we should simply open-code popcount
on the target?

I wonder if you can benchmark using ThickPopCount everywhere
(for a non-loopy replacement).

Note currently final value replacement uses expression_expensive_p
which eventually just checks is_inexpensive_builtin which
suffers from the very same issue (in theory).  Patching
is_inexpensive_builtin to check whether we have an optab might
do the trick (with other side-effects, like on inlining, of course).

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-10-08 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

--- Comment #5 from Martin Jambor  ---
(In reply to Richard Biener from comment #3)
> Can you point me to the source for which we generate the popcount call(s)? 
> It might be not final value replacement but instead code-generating a niter
> analysis result.

It turns out it is as simple as:


typedef unsigned long long BITBOARD;

int PopCount (BITBOARD b) {
int c = 0;

while (b) {
b &= b - 1;
c++;
}

return c;
}

with -O3.  Trunk generates a call to __popcountdi2 when GCC 7 (or
trunk with -march=skylake) does not.

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-10-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

--- Comment #4 from Alexander Monakov  ---
(In reply to Richard Biener from comment #3)
> Can you point me to the source for which we generate the popcount call(s)? 
> It might be not final value replacement but instead code-generating a niter
> analysis result.

Sorry, the reference to final value replacement was just in order to draw a
parallel. I was pointing out a famous historical precedent where GCC had to
avoid replacing a cheap loop with a costly libgcc call.

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-10-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
Can you point me to the source for which we generate the popcount call(s)?  It
might be not final value replacement but instead code-generating a niter
analysis result.

Generally I agree that libgcc popcount should be avoided but it's not always
easy.  Eventually libgcc popcount could do a ifarch() runtime check to use
the native popcount instruction.  Also that we only have popcountDI is
of course less than optimal given we loop over all bytes...

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-10-05 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
x86 has native popcount only with -msse4.2, otherwise popcount(int) first
zero-extends to 64-bit, then calls __popcountdi2 (64-bit libgcc popcount).

If the original code computes popcount on narrow types, or has only a few
non-zero bits, it can be expected that libgcc replacement is slower.

Even if size-wise popcount detection is an optimization, speed-wise GCC
probably should avoid replacing a simple loop with a libgcc call (just like
final value replacement avoids replacing a loop with computations involving
modulus/division).

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-10-05 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

Martin Jambor  changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #1 from Martin Jambor  ---
It seems that the machine does not like the newly generated calls into
libgcc for popcount.

The profile of r262486 (_slow variant) and the one immediately
preceding it (the _fast variant) is:

$ perf report -n --percent-limit=2 | cat

# Overhead   Samples  Command  Shared Object  Symbol
#     ...  .  .
#
 6.15%187930  deepsjeng_r_slow  deepsjeng_r   feval
 5.88%179434  deepsjeng_r_fast  deepsjeng_r   feval
 5.56%169734  deepsjeng_r_fast  deepsjeng_r   search
 5.42%165581  deepsjeng_r_slow  deepsjeng_r   search
 5.19%158575  deepsjeng_r_slow  deepsjeng_r   ProbeTT
 5.16%157546  deepsjeng_r_fast  deepsjeng_r   ProbeTT
 4.74%144696  deepsjeng_r_slow  deepsjeng_r   qsearch
 4.72%144193  deepsjeng_r_fast  deepsjeng_r   qsearch
 2.76% 84389  deepsjeng_r_slow  libgcc_s.so   __popcountdi2
 2.75% 83936  deepsjeng_r_fast  deepsjeng_r   see
 2.73% 83307  deepsjeng_r_slow  deepsjeng_r   see
 2.67% 81614  deepsjeng_r_slow  deepsjeng_r   order_moves
 2.62% 80077  deepsjeng_r_fast  deepsjeng_r   order_moves
 2.49% 76087  deepsjeng_r_slow  deepsjeng_r   FindFirstRemove
 2.47% 75346  deepsjeng_r_fast  deepsjeng_r   FindFirstRemove
 2.03% 61888  deepsjeng_r_fast  deepsjeng_r   make
 2.03% 61861  deepsjeng_r_slow  deepsjeng_r   make


The profile for r262864 (marked again as _slow below) and its
immediate predecessor (marked _fast) is:


# Overhead   Samples  Command  Shared Object  Symbol
#     ...  .  .
#
 5.87%192681  deepsjeng_r_slow  deepsjeng_r   feval
 5.74%188254  deepsjeng_r_fast  deepsjeng_r   feval
 5.48%179850  deepsjeng_r_slow  libgcc_s.so   __popcountdi2
 5.17%169671  deepsjeng_r_slow  deepsjeng_r   search
 5.04%165438  deepsjeng_r_fast  deepsjeng_r   search
 4.83%158368  deepsjeng_r_fast  deepsjeng_r   ProbeTT
 4.82%158096  deepsjeng_r_slow  deepsjeng_r   ProbeTT
 4.44%145659  deepsjeng_r_fast  deepsjeng_r   qsearch
 4.39%144117  deepsjeng_r_slow  deepsjeng_r   qsearch
 2.56% 84085  deepsjeng_r_fast  libgcc_s.so   __popcountdi2
 2.55% 83853  deepsjeng_r_slow  deepsjeng_r   see
 2.55% 83653  deepsjeng_r_fast  deepsjeng_r   see
 2.54% 83383  deepsjeng_r_fast  deepsjeng_r   order_moves
 2.44% 80246  deepsjeng_r_slow  deepsjeng_r   order_moves
 2.31% 75966  deepsjeng_r_fast  deepsjeng_r   FindFirstRemove
 2.30% 75575  deepsjeng_r_slow  deepsjeng_r   FindFirstRemove

Again, let me emphasize this is all about generic march/mtune, native
march/mtune is almost 3% faster than GCC 8.