[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738 --- Comment #7 from Andrew Pinski --- Note we don't need to do y&-y only if we keep track of popcount of the SSA_NAME. But we don't have that yet.
[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738 Andrew Pinski changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2021-09-26 Severity|normal |enhancement Status|UNCONFIRMED |NEW --- Comment #6 from Andrew Pinski --- Confirmed. I think x/(y&-y) should be expanded as x >> ctz (y&-y) + 1 (if ctz is an opcode) but this should be done only at expand time (unless we get a "lower" gimple phase).
[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738 --- Comment #5 from Thomas Koenig --- (In reply to Jakub Jelinek from comment #4) > What about a version that still sets lowest_bit to value & -value; rather > than 1 < ctz? I think this would be ideal, or close to it. > Also, I'm not sure you can safely do the (changed_bits >> ctz) >> 2 to > changed_bits >> (ctz + 2) transformation, while because of the division one > can count on value not being 0 (otherwise UB), value & -value can still be > e.g. 1U << 31 and then ctz 31 too, and changed_bits >> (31 + 2) being UB, > while > (changed_bits >> 31) >> 2 well defined returning 0. OK. > So, I think we could e.g. during expansion (or isel) based on target cost > optimize > x / (y & -y) to x >> __builtin_ctz (y) (also assuming the optab for ctz > exists), but anything else looks complicated. I think this would solve the issue for the original code (which is what people will find on the web if they google for HAKMEM 175).
[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #4 from Jakub Jelinek --- What about a version that still sets lowest_bit to value & -value; rather than 1 < ctz? Also, I'm not sure you can safely do the (changed_bits >> ctz) >> 2 to changed_bits >> (ctz + 2) transformation, while because of the division one can count on value not being 0 (otherwise UB), value & -value can still be e.g. 1U << 31 and then ctz 31 too, and changed_bits >> (31 + 2) being UB, while (changed_bits >> 31) >> 2 well defined returning 0. So, I think we could e.g. during expansion (or isel) based on target cost optimize x / (y & -y) to x >> __builtin_ctz (y) (also assuming the optab for ctz exists), but anything else looks complicated.
[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738 --- Comment #3 from Thomas Koenig --- Even faster code: ctz = __builtin_ctz (value); lowest_bit = value & - value; left_bits = value + lowest_bit; changed_bits = value ^ left_bits; right_bits = changed_bits >> (ctz + 2); return left_bits | right_bits; The first two instructions get compiled directly (with -march=native) to blsi%edi, %edx tzcntl %edi, %eax
[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738 --- Comment #2 from Thomas Koenig --- Created attachment 49516 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49516=edit Small benchmark Here's a small benchmark for counting all 32-bit numbers with 16 bits set according to the HAKMEM source. Timing is (first float is elapsed time in seconds for version with division, second float is for the shift): 2.319526 601080391 1.147284 601080391 with -O3 -march=native on an AMD Ryzen 7 1700X, 4.539288 601080391 2.700514 601080391 on POWER9.
[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Component|rtl-optimization|middle-end --- Comment #1 from Richard Biener --- OK, guess that we should see to replace x / y with y known to have exactly one bit set to x >> (ctz (y) + 1) note I'm quite sure this isn't faster for all power-of-two y. It's also not canonically simpler. In the end sth for instruction selection / RTL expansion I guess.