[Bug rtl-optimization/46235] inefficient bittest code generation

2022-03-22 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235
Bug 46235 depends on bug 104982, which changed state.

Bug 104982 Summary: [12 Regression] FAIL: gcc.target/i386/bt-5.c by r12-7687
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104982

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug rtl-optimization/46235] inefficient bittest code generation

2021-06-20 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Resolution|--- |FIXED
   Target Milestone|--- |12.0
 Status|UNCONFIRMED |RESOLVED

--- Comment #7 from Roger Sayle  ---
Fixed on mainline.

[Bug rtl-optimization/46235] inefficient bittest code generation

2021-06-16 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #6 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:3155d51bfd1de8b6c4645dcb2292248a8d7cc3c9

commit r12-1525-g3155d51bfd1de8b6c4645dcb2292248a8d7cc3c9
Author: Roger Sayle 
Date:   Wed Jun 16 09:56:09 2021 +0100

[PATCH] PR rtl-optimization/46235: Improved use of bt for bit tests on
x86_64.

This patch tackles PR46235 to improve the code generated for bit tests
on x86_64 by making more use of the bt instruction.  Currently, GCC emits
bt instructions when followed by condition jumps (thanks to Uros'
splitters).
This patch adds splitters in i386.md, to catch the cases where bt is
followed
by a conditional move (as in the original report), or by a setc/setnc (as
in
comment 5 of the Bugzilla PR).

With this patch, the function in the original PR
int foo(int a, int x, int y) {
if (a & (1 << x))
   return a;
   return 1;
}

which with -O2 on mainline generates:
foo:movl%edi, %eax
movl%esi, %ecx
sarl%cl, %eax
testb   $1, %al
movl$1, %eax
cmovne  %edi, %eax
ret

now generates:
foo:btl %esi, %edi
movl$1, %eax
cmovc   %edi, %eax
ret

Likewise, IsBitSet1 and IsBitSet2 (from comment 5)
bool IsBitSet1(unsigned char byte, int index) {
return (byte & (1<> index) & 1;
}

Before:
movzbl  %dil, %eax
movl%esi, %ecx
sarl%cl, %eax
andl$1, %eax
ret

After:
movzbl  %dil, %edi
btl %esi, %edi
setc%al
ret

According to Agner Fog, SAR/SHR r,cl takes 2 cycles on skylake,
where BT r,r takes only one, so the performance improvements on
recent hardware may be more significant than implied by just
the reduced number of instructions.  I've avoided transforming cases
(such as btsi_setcsi) where using bt sequences may not be a clear
win (over sarq/andl).

2010-06-15  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/46235
* config/i386/i386.md: New define_split for bt followed by cmov.
(*bt_setcqi): New define_insn_and_split for bt followed by
setc.
(*bt_setncqi): New define_insn_and_split for bt then setnc.
(*bt_setnc): New define_insn_and_split for bt followed
by setnc with zero extension.

gcc/testsuite/ChangeLog
PR rtl-optimization/46235
* gcc.target/i386/bt-5.c: New test.
* gcc.target/i386/bt-6.c: New test.
* gcc.target/i386/bt-7.c: New test.

[Bug rtl-optimization/46235] inefficient bittest code generation

2014-06-03 Thread chris.a.ferguson at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #5 from chris.a.ferguson at gmail dot com ---
This optimization opportunity is still being missed as of GCC 4.9.

Test cases:

bool IsBitSet1(unsigned char byte, int index)
{
return (byte  (1index)) != 0;
}

bool IsBitSet2(unsigned char byte, int index)
{
return (byte  index)  1;
}

From GCC 4.9:

IsBitSet1(unsigned char, int):
movecx, esi
moveax, 1
movzx  edi, dil
saleax, cl
test   eax, edi
setne  al
ret

IsBitSet2(unsigned char, int):
movzx  eax, dil
movecx, esi
sareax, cl
andeax, 1
ret


From Clang 3.3:

IsBitSet1(unsigned char, int):
btl%esi, %edi
setb   %al
ret

IsBitSet2(unsigned char, int):
btl%esi, %edi
setb   %al
ret


[Bug rtl-optimization/46235] inefficient bittest code generation

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #2 from Tony Poppleton tony.poppleton at gmail dot com 2011-01-28 
16:55:48 UTC ---
Based on Richard's comment, I tried a modified version of the code replacing
the (1  x) with just (16).

This shows that GCC (4.6  4.5.2) does perform an optimization similar to llvm,
and uses the testb instruction:
movl%edi, %eax
movl$1, %edx
testb   $16, %al
cmove   %edx, %eax
ret

Therefore, perhaps it would be beneficial to not convert from a  (n  x) to
(a  x)  n, in the special case where the value n is 1 (or a power of 2
potentially)?

Incidentally, the above code could have been optimized further to remove the
usage of edx entirely (will make a separate PR about that)


[Bug rtl-optimization/46235] inefficient bittest code generation

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #3 from Tony Poppleton tony.poppleton at gmail dot com 2011-01-28 
17:02:50 UTC ---
Actually what I said above isn't correct - had it compiled down to bt $4, %al
then it would make sense to do that special case, but as it used testb it is
inconclusive.


[Bug rtl-optimization/46235] inefficient bittest code generation

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #4 from Tony Poppleton tony.poppleton at gmail dot com 2011-01-28 
18:08:15 UTC ---
As a quick test, I commented out the block with the following comment in
fold-const.c:
  /* If this is an EQ or NE comparison with zero and ARG0 is
 (1  foo)  bar, convert it to (bar  foo)  1.  Both require
 two operations, but the latter can be done in one less insn
 on machines that have only two-operand insns or on which a
 constant cannot be the first operand.  */

This produces the following asm code:
movl$1, %edx
movl%edi, %eax
movl%esi, %ecx
movl%edx, %edi
sall%cl, %edi
testl   %eax, %edi
cmove   %edx, %eax
ret
(using modified GCC 4.6.0 20110122)

So whilst I was hoping for an easy quick-fix, it appears that the required
optimization to convert it into a btl test isn't there later on in the
compile.

Incidentally, from looking at http://gmplib.org/~tege/x86-timing.pdf, it
appears that bt is slow on P4 architecture (8 cycles if I am reading it
correctly?  which sounds slow), so the llvm code in the bug description isn't
necessarily an optimization on this arch.  Newer chips would probably still
benefit though.


[Bug rtl-optimization/46235] inefficient bittest code generation

2010-10-30 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-30 
18:21:48 UTC ---
We canonicalize a  (1  x) to (a  x)  1 very early.