[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #7 from Andrew Pinski --- Created attachment 51248 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51248&action=edit patch for the eq issue This is the patch for the eq issue
[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #6 from Andrew Pinski --- (In reply to Uroš Bizjak from comment #2) > A couple of *scc_bt patterns are missing. These are similar to already > existing *jcc_bt patterns. Combine wants: > > Failed to match this instruction: > (set (reg:QI 97) > (eq:QI (zero_extract:SI (reg/v:SI 91 [ x ]) > (const_int 1 [0x1]) > (zero_extend:SI (subreg:QI (reg/v:SI 92 [ bit ]) 0))) > (const_int 0 [0]))) I have a patch which changes that as I was running into something similar a few days ago.
[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #5 from Uroš Bizjak --- (In reply to Peter Cordes from comment #4) > (In reply to Uroš Bizjak from comment #2) > > A couple of *scc_bt patterns are missing. These are similar to already > > existing *jcc_bt patterns. Combine wants: > > Does gcc also need patterns for bt + cmovcc? Actually, my setcc bt patch interferes with cmovcc. Without the patch, gcc manages to create: sarl %cl, %edi andl $1, %edi cmovne %edx, %eax and when patched gcc creates bt insn, following asm is created: btl %esi, %edi setc %dl testl %edx, %edx cmovne %ecx, %eax. I'll revert the patch.
[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #4 from Peter Cordes --- (In reply to Uroš Bizjak from comment #2) > A couple of *scc_bt patterns are missing. These are similar to already > existing *jcc_bt patterns. Combine wants: Does gcc also need patterns for bt + cmovcc? Thinking about this again, with an immediate count <= 31 it might be best to test $0x0100, %edi / setz %al. BT might be shorter, needing only an imm8 instead of imm32. But TEST can run on more ports than BT on Intel. (Ryzen has 4 per clock bt throughput). (In some registers, TEST can check the low8 or high8 using an imm8, but high8 can have extra latency on HSW/SKL: https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to. But test $imm8, %al is only 2 bytes, or 3 bytes for low8 other than AL if a REX isn't needed. There's no test $imm8_sign_extended, r32/r64, so you need a REX to test the low byte of edi/esi/ebp.) But for a variable count, it's likely that BT is the best bet, even when booleanizing with setcc. At least if we avoid `movzx`, because bt/setcc/movzx is significantly worse than xor-zero / bt / setcc, for latency and for a false dependency on the destination register. With a constant count, SHR / AND is very good if we don't need to invert the boolean, and it's ok to destroy the source register. (Or of course just SHR if we want the high bit). If adding new BT/SETCC patterns, I guess we need to make sure gcc still uses SHR or SHR/AND where appropriate.
[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #3 from Peter Cordes --- Oops, BT sets CF, not ZF. So bt $13, %edi setnc %al# aka setae ret This is what clang does for the bt_ functions, and might be optimal for many use-cases. (For branching with an immediate, test/jcc is of course better because it can macro-fuse into a test+branch uop on Intel and AMD.)
[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #2 from Uroš Bizjak --- (In reply to Peter Cordes from comment #0) > Related: > > bool bt_unsigned(unsigned x, unsigned bit) { > //bit = 13; > return !(x & (1< } > > movl%esi, %ecx > movl$1, %eax > sall%cl, %eax > testl %edi, %eax > sete%al > ret > > This is weird. The code generated with 1U << bit is like the bt_signed > code above and has identical results, so gcc should emit whatever is optimal > for both cases. There are similar differences on ARM32. > > (With a fixed count, it just makes the difference between NOT vs. XOR $1.) > > If we're going to use setcc, it's definitely *much* better to use bt > instead of a variable-count shift + test. > > bt %esi, %edi > setz%al > ret A couple of *scc_bt patterns are missing. These are similar to already existing *jcc_bt patterns. Combine wants: Failed to match this instruction: (set (reg:QI 97) (eq:QI (zero_extract:SI (reg/v:SI 91 [ x ]) (const_int 1 [0x1]) (zero_extend:SI (subreg:QI (reg/v:SI 92 [ bit ]) 0))) (const_int 0 [0])))
[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #1 from Peter Cordes --- More generally, you can flip a higher bit while copying with lea 64(%rdi), %eax That leaves the bits above that position munged by carry-out, but that isn't always a problem.