[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175

2021-09-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738

--- Comment #7 from Andrew Pinski  ---
Note we don't need to do y&-y only if we keep track of popcount of the
SSA_NAME.  But we don't have that yet.

[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175

2021-09-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738

Andrew Pinski  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2021-09-26
   Severity|normal  |enhancement
 Status|UNCONFIRMED |NEW

--- Comment #6 from Andrew Pinski  ---
Confirmed.


I think x/(y&-y) should be expanded as x >> ctz (y&-y) + 1 (if ctz is an
opcode) but this should be done only at expand time (unless we get a "lower"
gimple phase).

[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175

2020-11-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738

--- Comment #5 from Thomas Koenig  ---
(In reply to Jakub Jelinek from comment #4)
> What about a version that still sets lowest_bit to value & -value; rather
> than 1 < ctz?

I think this would be ideal, or close to it.

> Also, I'm not sure you can safely do the (changed_bits >> ctz) >> 2 to
> changed_bits >> (ctz + 2) transformation, while because of the division one
> can count on value not being 0 (otherwise UB), value & -value can still be
> e.g. 1U << 31 and then ctz 31 too, and changed_bits >> (31 + 2) being UB,
> while
> (changed_bits >> 31) >> 2 well defined returning 0.

OK.

> So, I think we could e.g. during expansion (or isel) based on target cost
> optimize
> x / (y & -y) to x >> __builtin_ctz (y) (also assuming the optab for ctz
> exists), but anything else looks complicated.

I think this would solve the issue for the original code (which is
what people will find on the web if they google for HAKMEM 175).

[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175

2020-11-06 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek  ---
What about a version that still sets lowest_bit to value & -value; rather than
1 < ctz?
Also, I'm not sure you can safely do the (changed_bits >> ctz) >> 2 to
changed_bits >> (ctz + 2) transformation, while because of the division one can
count on value not being 0 (otherwise UB), value & -value can still be e.g. 1U
<< 31 and then ctz 31 too, and changed_bits >> (31 + 2) being UB, while
(changed_bits >> 31) >> 2 well defined returning 0.

So, I think we could e.g. during expansion (or isel) based on target cost
optimize
x / (y & -y) to x >> __builtin_ctz (y) (also assuming the optab for ctz
exists), but anything else looks complicated.

[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175

2020-11-06 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738

--- Comment #3 from Thomas Koenig  ---
Even faster code:

  ctz = __builtin_ctz (value);
  lowest_bit = value & - value;
  left_bits = value + lowest_bit;
  changed_bits = value ^ left_bits;
  right_bits = changed_bits >> (ctz + 2);
  return left_bits | right_bits;

The first two instructions get compiled directly (with -march=native)
to

blsi%edi, %edx
tzcntl  %edi, %eax

[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175

2020-11-06 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738

--- Comment #2 from Thomas Koenig  ---
Created attachment 49516
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49516=edit
Small benchmark

Here's a small benchmark for counting all 32-bit numbers with 16 bits set
according to the HAKMEM source.

Timing is (first float is elapsed time in seconds for version with division,
second float is for the shift):

2.319526 601080391
1.147284 601080391

with -O3 -march=native on an AMD Ryzen 7 1700X,

4.539288 601080391
2.700514 601080391

on POWER9.

[Bug middle-end/97738] Optimizing division by value & - value for HAKMEM 175

2020-11-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
  Component|rtl-optimization|middle-end

--- Comment #1 from Richard Biener  ---
OK, guess that we should see to replace

  x / y

with y known to have exactly one bit set to

 x >> (ctz (y) + 1)

note I'm quite sure this isn't faster for all power-of-two y.
It's also not canonically simpler.  In the end sth for
instruction selection / RTL expansion I guess.