[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2021-08-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #7 from Andrew Pinski  ---
Created attachment 51248
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51248&action=edit
patch for the eq issue

This is the patch for the eq issue

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2021-08-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #6 from Andrew Pinski  ---
(In reply to Uroš Bizjak from comment #2)
> A couple of *scc_bt patterns are missing. These are similar to already
> existing *jcc_bt patterns. Combine wants:
> 
> Failed to match this instruction:
> (set (reg:QI 97)
> (eq:QI (zero_extract:SI (reg/v:SI 91 [ x ])
> (const_int 1 [0x1])
> (zero_extend:SI (subreg:QI (reg/v:SI 92 [ bit ]) 0)))
> (const_int 0 [0])))

I have a patch which changes that as I was running into something similar a few
days ago.

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #5 from Uroš Bizjak  ---
(In reply to Peter Cordes from comment #4)
> (In reply to Uroš Bizjak from comment #2)
> > A couple of *scc_bt patterns are missing. These are similar to already
> > existing *jcc_bt patterns. Combine wants:
> 
> Does gcc also need patterns for bt + cmovcc?

Actually, my setcc bt patch interferes with cmovcc. Without the patch, gcc
manages to create:

sarl %cl, %edi
andl $1, %edi
cmovne %edx, %eax

and when patched gcc creates bt insn, following asm is created:

btl %esi, %edi
setc %dl
testl %edx, %edx
cmovne %ecx, %eax.

I'll revert the patch.

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #4 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #2)
> A couple of *scc_bt patterns are missing. These are similar to already
> existing *jcc_bt patterns. Combine wants:

Does gcc also need patterns for bt + cmovcc?

Thinking about this again, with an immediate count <= 31 it might be best to
test $0x0100, %edi / setz %al.  BT might be shorter, needing only an imm8
instead of imm32.  But TEST can run on more ports than BT on Intel.  (Ryzen has
4 per clock bt throughput).

(In some registers, TEST can check the low8 or high8 using an imm8, but high8
can have extra latency on HSW/SKL:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to.
 But test $imm8, %al is only 2 bytes, or 3 bytes for low8 other than AL if a
REX isn't needed.  There's no test $imm8_sign_extended, r32/r64, so you need a
REX to test the low byte of edi/esi/ebp.)

But for a variable count, it's likely that BT is the best bet, even when
booleanizing with setcc.  At least if we avoid `movzx`, because bt/setcc/movzx
is significantly worse than  xor-zero / bt / setcc, for latency and for a false
dependency on the destination register.

With a constant count, SHR / AND is very good if we don't need to invert the
boolean, and it's ok to destroy the source register.  (Or of course just SHR if
we want the high bit).  If adding new BT/SETCC patterns, I guess we need to
make sure gcc still uses SHR or SHR/AND where appropriate.

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #3 from Peter Cordes  ---
Oops, BT sets CF, not ZF.  So

bt  $13, %edi
setnc   %al# aka setae
ret

This is what clang does for the bt_ functions, and might be optimal for many
use-cases.  (For branching with an immediate, test/jcc is of course better
because it can macro-fuse into a test+branch uop on Intel and AMD.)

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #2 from Uroš Bizjak  ---
(In reply to Peter Cordes from comment #0)

> Related:
> 
> bool bt_unsigned(unsigned x, unsigned bit) {
> //bit = 13;
> return !(x & (1< }
> 
> movl%esi, %ecx
> movl$1, %eax
> sall%cl, %eax
> testl   %edi, %eax
> sete%al
> ret
> 
> This is weird.  The code generated with  1U << bit  is like the bt_signed
> code above and has identical results, so gcc should emit whatever is optimal
> for both cases.  There are similar differences on ARM32.
> 
> (With a fixed count, it just makes the difference between NOT vs. XOR $1.)
> 
> If we're going to use setcc, it's definitely *much* better to use  bt 
> instead of a variable-count shift + test.
> 
> bt  %esi, %edi
> setz%al
> ret

A couple of *scc_bt patterns are missing. These are similar to already existing
*jcc_bt patterns. Combine wants:

Failed to match this instruction:
(set (reg:QI 97)
(eq:QI (zero_extract:SI (reg/v:SI 91 [ x ])
(const_int 1 [0x1])
(zero_extend:SI (subreg:QI (reg/v:SI 92 [ bit ]) 0)))
(const_int 0 [0])))

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #1 from Peter Cordes  ---
More generally, you can flip a higher bit while copying with

lea  64(%rdi), %eax

That leaves the bits above that position munged by carry-out, but that isn't
always a problem.