[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #21 from Wilco --- (In reply to Gabriel Ravier from comment #19) > If the original code being branchless makes it faster, wouldn't that imply > that we should use the table-based implementation when generating code for > `__builtin_ctz` ? __builtin_ctz is 3-4 times faster than the table implementation, so this optimization is always worth it. This is why I believe the current situation is not ideal since various targets still set CTZ_DEFINED_VALUE_AT_ZERO to 0 or 1. One option would be to always allow it in Gimple (perhaps add an extra argument for the value to return for a zero input), and at expand time check whether the backend supports the requested value. It it doesn't, emit branches.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #20 from Jakub Jelinek --- No, because __builtin_ctz is branchless too, it just has UB when the argument is 0.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #19 from Gabriel Ravier --- (In reply to Jakub Jelinek from comment #14) > The patch does: > + bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) > == 2; > + > + /* Skip if there is no value defined at zero, or if we can't easily > +return the correct value for zero. */ > + if (!zero_ok) > + return false; > + if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size)) > + return false; > For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd > need > to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with > GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new > basic blocks right now), where there is a high chance that RTL opts would > turn it back into unconditional > ctz. > That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is > 0 there. > We could handle even that case by doing the branches around, but those would > stay there > in the generated code, at which point I wonder whether it would be a win. > The original > code is branchless... If the original code being branchless makes it faster, wouldn't that imply that we should use the table-based implementation when generating code for `__builtin_ctz` ?
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #18 from Jakub Jelinek --- It is generally a win for cases where the condition can't be predicted, while if it can, jumps are much better. We have dozens or hundreds of PRs about this in either direction on x86.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #17 from Wilco --- (In reply to Jakub Jelinek from comment #16) > (In reply to Wilco from comment #15) > > It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO > > == 2 so that you always get the same result even when you don't have tzcnt. > > A conditional move would be possible, so it adds an extra 2 instructions at > > worst (ie. still significantly faster than doing the table lookup, multiply > > etc). And it could be optimized when you know CLZ/CTZ input is non-zero. > > Conditional moves are a lottery on x86, in many cases very bad idea. And > when people actually use __builtin_clz*, they state that they don't care > about the 0 value, so emitting terribly performing code for it just in case > would be wrong. > If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=2 > case, on targets where conditional moves are beneficial for it it can also > emit them, or emit the jump which say on x86 will be most likely faster than > cmov. Well GCC emits a cmov for this (-O2 -march=x86-64-v2): int ctz(long a) { return (a == 0) ? 64 : __builtin_ctzl (a); } ctz: xor edx, edx mov eax, 64 rep bsf rdx, rdi testrdi, rdi cmovne eax, edx ret Note the extra 'test' seems redundant since IIRC bsf sets Z=1 if the input is zero. On Zen 2 this has identical performance as the plain builtin when you loop it as res = ctz (res) + 1; (ie. measuring latency of non-zero case). So I find it hard to believe cmov is expensive on modern cores.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #16 from Jakub Jelinek --- (In reply to Wilco from comment #15) > It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO > == 2 so that you always get the same result even when you don't have tzcnt. > A conditional move would be possible, so it adds an extra 2 instructions at > worst (ie. still significantly faster than doing the table lookup, multiply > etc). And it could be optimized when you know CLZ/CTZ input is non-zero. Conditional moves are a lottery on x86, in many cases very bad idea. And when people actually use __builtin_clz*, they state that they don't care about the 0 value, so emitting terribly performing code for it just in case would be wrong. If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=2 case, on targets where conditional moves are beneficial for it it can also emit them, or emit the jump which say on x86 will be most likely faster than cmov.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #15 from Wilco --- (In reply to Jakub Jelinek from comment #14) > The patch does: > + bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) > == 2; > + > + /* Skip if there is no value defined at zero, or if we can't easily > +return the correct value for zero. */ > + if (!zero_ok) > + return false; > + if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size)) > + return false; > For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd > need > to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with > GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new > basic blocks right now), where there is a high chance that RTL opts would > turn it back into unconditional > ctz. > That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is > 0 there. > We could handle even that case by doing the branches around, but those would > stay there > in the generated code, at which point I wonder whether it would be a win. > The original > code is branchless... It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO == 2 so that you always get the same result even when you don't have tzcnt. A conditional move would be possible, so it adds an extra 2 instructions at worst (ie. still significantly faster than doing the table lookup, multiply etc). And it could be optimized when you know CLZ/CTZ input is non-zero.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #14 from Jakub Jelinek --- The patch does: + bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) == 2; + + /* Skip if there is no value defined at zero, or if we can't easily +return the correct value for zero. */ + if (!zero_ok) + return false; + if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size)) + return false; For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd need to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new basic blocks right now), where there is a high chance that RTL opts would turn it back into unconditional ctz. That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is 0 there. We could handle even that case by doing the branches around, but those would stay there in the generated code, at which point I wonder whether it would be a win. The original code is branchless...
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #13 from Andrew Pinski --- (In reply to Gabriel Ravier from comment #12) > It appears this new optimization is non-functional on trunk with x86-64... > specifically on x86-64, too, on AArch64 it works just fine. So does that > mean this bug should be re-opened or should a new bug be opened for that ? It does work with -mbmi where the instruction which is used has a defined value at 0. You can open a new issue for improvement of the case for not supplying -mbmi . That is CTZ_DEFINED_VALUE_AT_ZERO is non-2.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 Gabriel Ravier changed: What|Removed |Added CC||gabravier at gmail dot com --- Comment #12 from Gabriel Ravier --- It appears this new optimization is non-functional on trunk with x86-64... specifically on x86-64, too, on AArch64 it works just fine. So does that mean this bug should be re-opened or should a new bug be opened for that ?
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 Tamar Christina changed: What|Removed |Added Status|ASSIGNED|RESOLVED CC||tnfchris at gcc dot gnu.org Resolution|--- |FIXED Target Milestone|--- |10.0 --- Comment #11 from Tamar Christina --- Fixed on master
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #10 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=eb7c7c524556df5364f03adc20f6a9db20858484 commit r10-5912-geb7c7c524556df5364f03adc20f6a9db20858484 Author: Jakub Jelinek Date: Mon Jan 13 14:14:57 2020 +0100 tree-opt: Fix bootstrap failure in tree-ssa-forwprop.c some more PR90838 2020-01-13 Jakub Jelinek PR tree-optimization/90838 * tree-ssa-forwprop.c (simplify_count_trailing_zeroes): Use SCALAR_INT_TYPE_MODE directly in CTZ_DEFINED_VALUE_AT_ZERO macro argument rather than to initialize temporary for targets that don't use the mode argument at all. Initialize ctzval to avoid warning at -O0.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #9 from Jakub Jelinek --- Author: jakub Date: Fri Jan 10 21:10:03 2020 New Revision: 280140 URL: https://gcc.gnu.org/viewcvs?rev=280140&root=gcc&view=rev Log: PR tree-optimization/90838 * tree-ssa-forwprop.c (simplify_count_trailing_zeroes): Use SCALAR_INT_TYPE_MODE instead of TYPE_MODE as operand of CTZ_DEFINED_VALUE_AT_ZERO. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-forwprop.c
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #8 from Wilco --- Author: wilco Date: Fri Jan 10 19:32:53 2020 New Revision: 280132 URL: https://gcc.gnu.org/viewcvs?rev=280132&root=gcc&view=rev Log: PR90838: Support ctz idioms Support common idioms for count trailing zeroes using an array lookup. The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic constant which when multiplied by a power of 2 creates a unique value in the top 5 or 6 bits. This is then indexed into a table which maps it to the number of trailing zeroes. When the table is valid, we emit a sequence using the target defined value for ctz (0): int ctz1 (unsigned x) { static const char table[32] = { 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27]; } Is optimized to: rbitw0, w0 clz w0, w0 and w0, w0, 31 ret gcc/ PR tree-optimization/90838 * tree-ssa-forwprop.c (check_ctz_array): Add new function. (check_ctz_string): Likewise. (optimize_count_trailing_zeroes): Likewise. (simplify_count_trailing_zeroes): Likewise. (pass_forwprop::execute): Try ctz simplification. * match.pd: Add matching for ctz idioms. testsuite/ PR tree-optimization/90838 * testsuite/gcc.target/aarch64/pr90838.c: New test. Added: trunk/gcc/testsuite/gcc.target/aarch64/pr90838.c Modified: trunk/gcc/ChangeLog trunk/gcc/match.pd trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-forwprop.c
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 Wilco changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-08-09 Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #7 from Wilco --- I'll have a look at this, I think it could easily be done in match.pd if we add support for matching array references.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #6 from rguenther at suse dot de --- On Wed, 12 Jun 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 > > --- Comment #5 from ktkachov at gcc dot gnu.org --- > FWIW, there is another similar function in deepsjeng that computes a > side-effect: > int > myctz2 (unsigned long long * const b) { > unsigned long long lsb = (*b) & -(*b); > *b ^= lsb; > return table[(lsb * magic) >> 58]; > } > > so we'd need to make sure that our solution handles multiple uses of lsb I'd say we want to start the pattern matching from the table lookup. I agree match.pd is too expensive unless we have pre-analyzed initializers in varpool nodes. Writing the matching of the index as exported (match (...)) function is still something we can do. I've for some time wanted the excuse for somebody to support writing in tree-ssa-math-opts.c for example /* !start-match (match (foo @0 @1) ()) !end-match */ and a MATCH_FILES=tree-ssa-math-opts.c gimple-match-fns.c: MATCH_FILES extract-parts.sh MATCH_FILES > temp.pd build/genmatch --gimple --export temp.pd > $< or something along this lines. You can then do if (gimple_foo (ssa_op, &op0, &op1, NULL)) ... to do the pattern matching. Currently all (match ..) functions are exported in gimple-match.c and that could then change with --export only done to those used from outside match.pd.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #5 from ktkachov at gcc dot gnu.org --- FWIW, there is another similar function in deepsjeng that computes a side-effect: int myctz2 (unsigned long long * const b) { unsigned long long lsb = (*b) & -(*b); *b ^= lsb; return table[(lsb * magic) >> 58]; } so we'd need to make sure that our solution handles multiple uses of lsb
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #4 from Jakub Jelinek --- I think it looks too complex to do it in match.pd, wouldn't it be better to do it say in tree-ssa-math-opts.c and thus just once? That said, due to the x & -x it is really bound to a small number of values that need to be checked, so shouldn't be that expensive. One question is if it should start from the x & -x operation and go through immediate uses to a table lookup, or if we start from the table lookup and follow index defs to the x & -x operation, going through a small limited number of casts and *, >> and % operations (with constant last operands). Guess we need to also compute approximate costs of the originally used sequence and compare that to the cost of ctz (and only do it if there is hw ctz). And decide if we emit x ? ctz (x) : const; or something different. This is GIMPLE, so __builtin_ctz* is undefined at 0 unconditionally, it is up to the RTL/backends then to optimize the conditionally called ctz into unconditional one if the value of ctz (0) in hw is equal or related to the chosen value.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener --- I think if we start to pattern-match this kind of stuff we have to be careful to do it once only and for all table-based stuff so we can classify each used table and not need to re-scan it for each and every use. Alternatively we need to de-couple the table classification and stick the classification result somewhere (the varpool node for example) so we can "easily" re-use it. The first approach means it would need to be a IPA pass. What about code that has the table in an automatic variable [const] int table[64] = { ... }; ? That might end up inline expanded from gimplification, copied from a .LC0 constant or whatnot.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Jakub Jelinek from comment #1) > __builtin_ctzll is undefined for 0, while the above is well defined and > returns 0. When the target ctz is well defined and returns 64 for 0, and we want to return 0 instead, this will work: __builtin_ctzll (b) & 63
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #1 from Jakub Jelinek --- __builtin_ctzll is undefined for 0, while the above is well defined and returns 0. https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightModLookup has two different table lookups for 32-bit ctz. https://doc.lagout.org/security/Hackers%20Delight.pdf has others. So, if we want to pattern discover this, it might be best to just look for x & -x and some arithmetics on that to turn it into a table lookup index and then just for all the possible values of x & -x (i.e. bitsize(x) + 1) compute the arithmetics (multiplication, modulo, shift, ...) and check the table content at that spot, whether it matches the corresponding ctz value, or remember the value for 0.