[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #21 from Wilco  ---
(In reply to Gabriel Ravier from comment #19)

> If the original code being branchless makes it faster, wouldn't that imply
> that we should use the table-based implementation when generating code for
> `__builtin_ctz` ?

__builtin_ctz is 3-4 times faster than the table implementation, so this
optimization is always worth it. This is why I believe the current situation is
not ideal since various targets still set CTZ_DEFINED_VALUE_AT_ZERO to 0 or 1.
One option would be to always allow it in Gimple (perhaps add an extra argument
for the value to return for a zero input), and at expand time check whether the
backend supports the requested value. It it doesn't, emit branches.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #20 from Jakub Jelinek  ---
No, because __builtin_ctz is branchless too, it just has UB when the argument
is 0.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread gabravier at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #19 from Gabriel Ravier  ---
(In reply to Jakub Jelinek from comment #14)
> The patch does:
> +  bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval)
> == 2;
> +
> +  /* Skip if there is no value defined at zero, or if we can't easily
> +return the correct value for zero.  */
> +  if (!zero_ok)
> +   return false;
> +  if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size))
> +   return false;
> For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd
> need
> to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with
> GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new
> basic blocks right now), where there is a high chance that RTL opts would
> turn it back into unconditional
> ctz.
> That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is
> 0 there.
> We could handle even that case by doing the branches around, but those would
> stay there
> in the generated code, at which point I wonder whether it would be a win. 
> The original
> code is branchless...

If the original code being branchless makes it faster, wouldn't that imply that
we should use the table-based implementation when generating code for
`__builtin_ctz` ?

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #18 from Jakub Jelinek  ---
It is generally a win for cases where the condition can't be predicted, while
if it can, jumps are much better.  We have dozens or hundreds of PRs about this
in either direction on x86.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #17 from Wilco  ---
(In reply to Jakub Jelinek from comment #16)
> (In reply to Wilco from comment #15)
> > It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO
> > == 2 so that you always get the same result even when you don't have tzcnt.
> > A conditional move would be possible, so it adds an extra 2 instructions at
> > worst (ie. still significantly faster than doing the table lookup, multiply
> > etc). And it could be optimized when you know CLZ/CTZ input is non-zero.
> 
> Conditional moves are a lottery on x86, in many cases very bad idea.  And
> when people actually use __builtin_clz*, they state that they don't care
> about the 0 value, so emitting terribly performing code for it just in case
> would be wrong.
> If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=2
> case, on targets where conditional moves are beneficial for it it can also
> emit them, or emit the jump which say on x86 will be most likely faster than
> cmov.

Well GCC emits a cmov for this (-O2 -march=x86-64-v2):

int ctz(long a)
{
  return (a == 0) ? 64 : __builtin_ctzl (a);
}

ctz:
xor edx, edx
mov eax, 64
rep bsf rdx, rdi
testrdi, rdi
cmovne  eax, edx
ret

Note the extra 'test' seems redundant since IIRC bsf sets Z=1 if the input is
zero.

On Zen 2 this has identical performance as the plain builtin when you loop it
as res = ctz (res) + 1; (ie. measuring latency of non-zero case). So I find it
hard to believe cmov is expensive on modern cores.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #16 from Jakub Jelinek  ---
(In reply to Wilco from comment #15)
> It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO
> == 2 so that you always get the same result even when you don't have tzcnt.
> A conditional move would be possible, so it adds an extra 2 instructions at
> worst (ie. still significantly faster than doing the table lookup, multiply
> etc). And it could be optimized when you know CLZ/CTZ input is non-zero.

Conditional moves are a lottery on x86, in many cases very bad idea.  And when
people actually use __builtin_clz*, they state that they don't care about the 0
value, so emitting terribly performing code for it just in case would be wrong.
If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=2 case,
on targets where conditional moves are beneficial for it it can also emit them,
or emit the jump which say on x86 will be most likely faster than cmov.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #15 from Wilco  ---
(In reply to Jakub Jelinek from comment #14)
> The patch does:
> +  bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval)
> == 2;
> +
> +  /* Skip if there is no value defined at zero, or if we can't easily
> +return the correct value for zero.  */
> +  if (!zero_ok)
> +   return false;
> +  if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size))
> +   return false;
> For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd
> need
> to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with
> GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new
> basic blocks right now), where there is a high chance that RTL opts would
> turn it back into unconditional
> ctz.
> That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is
> 0 there.
> We could handle even that case by doing the branches around, but those would
> stay there
> in the generated code, at which point I wonder whether it would be a win. 
> The original
> code is branchless...

It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO == 2
so that you always get the same result even when you don't have tzcnt. A
conditional move would be possible, so it adds an extra 2 instructions at worst
(ie. still significantly faster than doing the table lookup, multiply etc). And
it could be optimized when you know CLZ/CTZ input is non-zero.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #14 from Jakub Jelinek  ---
The patch does:
+  bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) ==
2;
+
+  /* Skip if there is no value defined at zero, or if we can't easily
+return the correct value for zero.  */
+  if (!zero_ok)
+   return false;
+  if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size))
+   return false;
For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd
need
to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with
GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new
basic blocks right now), where there is a high chance that RTL opts would turn
it back into unconditional
ctz.
That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is 0
there.
We could handle even that case by doing the branches around, but those would
stay there
in the generated code, at which point I wonder whether it would be a win.  The
original
code is branchless...

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-16 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #13 from Andrew Pinski  ---
(In reply to Gabriel Ravier from comment #12)
> It appears this new optimization is non-functional on trunk with x86-64...
> specifically on x86-64, too, on AArch64 it works just fine. So does that
> mean this bug should be re-opened or should a new bug be opened for that ?

It does work with -mbmi where the instruction which is used has a defined value
at 0.
You can open a new issue for improvement of the case for not supplying -mbmi .
That is CTZ_DEFINED_VALUE_AT_ZERO is non-2.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-16 Thread gabravier at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

Gabriel Ravier  changed:

   What|Removed |Added

 CC||gabravier at gmail dot com

--- Comment #12 from Gabriel Ravier  ---
It appears this new optimization is non-functional on trunk with x86-64...
specifically on x86-64, too, on AArch64 it works just fine. So does that mean
this bug should be re-opened or should a new bug be opened for that ?

[Bug tree-optimization/90838] Detect table-based ctz implementation

2020-01-14 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 CC||tnfchris at gcc dot gnu.org
 Resolution|--- |FIXED
   Target Milestone|--- |10.0

--- Comment #11 from Tamar Christina  ---
Fixed on master

[Bug tree-optimization/90838] Detect table-based ctz implementation

2020-01-13 Thread cvs-commit at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #10 from CVS Commits  ---
The master branch has been updated by Jakub Jelinek :

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=eb7c7c524556df5364f03adc20f6a9db20858484

commit r10-5912-geb7c7c524556df5364f03adc20f6a9db20858484
Author: Jakub Jelinek 
Date:   Mon Jan 13 14:14:57 2020 +0100

tree-opt: Fix bootstrap failure in tree-ssa-forwprop.c some more PR90838

2020-01-13  Jakub Jelinek  

PR tree-optimization/90838
* tree-ssa-forwprop.c (simplify_count_trailing_zeroes): Use
SCALAR_INT_TYPE_MODE directly in CTZ_DEFINED_VALUE_AT_ZERO macro
argument rather than to initialize temporary for targets that
don't use the mode argument at all.  Initialize ctzval to avoid
warning at -O0.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2020-01-10 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #9 from Jakub Jelinek  ---
Author: jakub
Date: Fri Jan 10 21:10:03 2020
New Revision: 280140

URL: https://gcc.gnu.org/viewcvs?rev=280140&root=gcc&view=rev
Log:
PR tree-optimization/90838
* tree-ssa-forwprop.c (simplify_count_trailing_zeroes): Use
SCALAR_INT_TYPE_MODE instead of TYPE_MODE as operand of
CTZ_DEFINED_VALUE_AT_ZERO.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-forwprop.c

[Bug tree-optimization/90838] Detect table-based ctz implementation

2020-01-10 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #8 from Wilco  ---
Author: wilco
Date: Fri Jan 10 19:32:53 2020
New Revision: 280132

URL: https://gcc.gnu.org/viewcvs?rev=280132&root=gcc&view=rev
Log:
PR90838: Support ctz idioms

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 creates a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

rbitw0, w0
clz w0, w0
and w0, w0, 31
ret

gcc/
PR tree-optimization/90838
* tree-ssa-forwprop.c (check_ctz_array): Add new function.
(check_ctz_string): Likewise.
(optimize_count_trailing_zeroes): Likewise.
(simplify_count_trailing_zeroes): Likewise.
(pass_forwprop::execute): Try ctz simplification.
* match.pd: Add matching for ctz idioms.

testsuite/
PR tree-optimization/90838
* testsuite/gcc.target/aarch64/pr90838.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/aarch64/pr90838.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/match.pd
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-forwprop.c

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-08-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-08-09
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #7 from Wilco  ---
I'll have a look at this, I think it could easily be done in match.pd if we add
support for matching array references.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-06-12 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #6 from rguenther at suse dot de  ---
On Wed, 12 Jun 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838
> 
> --- Comment #5 from ktkachov at gcc dot gnu.org ---
> FWIW, there is another similar function in deepsjeng that computes a
> side-effect:
> int
> myctz2 (unsigned long long * const b) {
>  unsigned long long lsb = (*b) & -(*b);
> *b ^= lsb;
> return table[(lsb * magic) >> 58];
> }
> 
> so we'd need to make sure that our solution handles multiple uses of lsb

I'd say we want to start the pattern matching from the table lookup.

I agree match.pd is too expensive unless we have pre-analyzed
initializers in varpool nodes.

Writing the matching of the index as exported (match (...)) function
is still something we can do.  I've for some time wanted the excuse
for somebody to support writing in tree-ssa-math-opts.c for example

/* !start-match
   (match (foo @0 @1)
())
   !end-match */

and a

MATCH_FILES=tree-ssa-math-opts.c

gimple-match-fns.c: MATCH_FILES
extract-parts.sh MATCH_FILES > temp.pd
build/genmatch --gimple --export temp.pd > $<

or something along this lines.  You can then do

if (gimple_foo (ssa_op, &op0, &op1, NULL))
  ...

to do the pattern matching.  Currently all (match ..) functions
are exported in gimple-match.c and that could then change
with --export only done to those used from outside match.pd.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-06-12 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #5 from ktkachov at gcc dot gnu.org ---
FWIW, there is another similar function in deepsjeng that computes a
side-effect:
int
myctz2 (unsigned long long * const b) {
 unsigned long long lsb = (*b) & -(*b);
*b ^= lsb;
return table[(lsb * magic) >> 58];
}

so we'd need to make sure that our solution handles multiple uses of lsb

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-06-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #4 from Jakub Jelinek  ---
I think it looks too complex to do it in match.pd, wouldn't it be better to do
it say in tree-ssa-math-opts.c and thus just once?
That said, due to the x & -x it is really bound to a small number of values
that need to be checked, so shouldn't be that expensive.  One question is if it
should start from the x & -x operation and go through immediate uses to a table
lookup, or if we start from the table lookup and follow index defs to the x &
-x operation, going through a small limited number of casts and *, >> and %
operations (with constant last operands).  Guess we need to also compute
approximate costs of the originally used sequence and compare that to the cost
of ctz (and only do it if there is hw ctz).  And decide if we emit x ? ctz (x)
: const; or something different.  This is GIMPLE, so __builtin_ctz* is
undefined at 0 unconditionally, it is up to the RTL/backends then to optimize
the conditionally called ctz into unconditional one if the value of ctz (0) in
hw is equal or related to the chosen value.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-06-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
I think if we start to pattern-match this kind of stuff we have to be careful
to do it once only and for all table-based stuff so we can classify
each used table and not need to re-scan it for each and every use. 
Alternatively
we need to de-couple the table classification and stick the classification
result somewhere (the varpool node for example) so we can "easily" re-use it.

The first approach means it would need to be a IPA pass.

What about code that has the table in an automatic variable

  [const] int table[64] = { ... };

?  That might end up inline expanded from gimplification, copied
from a .LC0 constant or whatnot.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-06-11 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Jakub Jelinek from comment #1)
> __builtin_ctzll is undefined for 0, while the above is well defined and
> returns 0.

When the target ctz is well defined and returns 64 for 0, and we want to return
0 instead, this will work:

__builtin_ctzll (b) & 63

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-06-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek  ---
__builtin_ctzll is undefined for 0, while the above is well defined and returns
0.

https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightModLookup
has two different table lookups for 32-bit ctz. 
https://doc.lagout.org/security/Hackers%20Delight.pdf has others.  So, if we
want to pattern discover this, it might be best to just look for x & -x and
some arithmetics on that to turn it into a table lookup index and then just for
all the possible values of x & -x (i.e. bitsize(x) + 1) compute the arithmetics
(multiplication, modulo, shift, ...) and check the table content at that spot,
whether it matches the corresponding ctz value, or remember the value for 0.