[Bug rtl-optimization/46235] inefficient bittest code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 Bug 46235 depends on bug 104982, which changed state. Bug 104982 Summary: [12 Regression] FAIL: gcc.target/i386/bt-5.c by r12-7687 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104982 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug rtl-optimization/46235] inefficient bittest code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Resolution|--- |FIXED Target Milestone|--- |12.0 Status|UNCONFIRMED |RESOLVED --- Comment #7 from Roger Sayle --- Fixed on mainline.
[Bug rtl-optimization/46235] inefficient bittest code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #6 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:3155d51bfd1de8b6c4645dcb2292248a8d7cc3c9 commit r12-1525-g3155d51bfd1de8b6c4645dcb2292248a8d7cc3c9 Author: Roger Sayle Date: Wed Jun 16 09:56:09 2021 +0100 [PATCH] PR rtl-optimization/46235: Improved use of bt for bit tests on x86_64. This patch tackles PR46235 to improve the code generated for bit tests on x86_64 by making more use of the bt instruction. Currently, GCC emits bt instructions when followed by condition jumps (thanks to Uros' splitters). This patch adds splitters in i386.md, to catch the cases where bt is followed by a conditional move (as in the original report), or by a setc/setnc (as in comment 5 of the Bugzilla PR). With this patch, the function in the original PR int foo(int a, int x, int y) { if (a & (1 << x)) return a; return 1; } which with -O2 on mainline generates: foo:movl%edi, %eax movl%esi, %ecx sarl%cl, %eax testb $1, %al movl$1, %eax cmovne %edi, %eax ret now generates: foo:btl %esi, %edi movl$1, %eax cmovc %edi, %eax ret Likewise, IsBitSet1 and IsBitSet2 (from comment 5) bool IsBitSet1(unsigned char byte, int index) { return (byte & (1<> index) & 1; } Before: movzbl %dil, %eax movl%esi, %ecx sarl%cl, %eax andl$1, %eax ret After: movzbl %dil, %edi btl %esi, %edi setc%al ret According to Agner Fog, SAR/SHR r,cl takes 2 cycles on skylake, where BT r,r takes only one, so the performance improvements on recent hardware may be more significant than implied by just the reduced number of instructions. I've avoided transforming cases (such as btsi_setcsi) where using bt sequences may not be a clear win (over sarq/andl). 2010-06-15 Roger Sayle gcc/ChangeLog PR rtl-optimization/46235 * config/i386/i386.md: New define_split for bt followed by cmov. (*bt_setcqi): New define_insn_and_split for bt followed by setc. (*bt_setncqi): New define_insn_and_split for bt then setnc. (*bt_setnc): New define_insn_and_split for bt followed by setnc with zero extension. gcc/testsuite/ChangeLog PR rtl-optimization/46235 * gcc.target/i386/bt-5.c: New test. * gcc.target/i386/bt-6.c: New test. * gcc.target/i386/bt-7.c: New test.
[Bug rtl-optimization/46235] inefficient bittest code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #5 from chris.a.ferguson at gmail dot com --- This optimization opportunity is still being missed as of GCC 4.9. Test cases: bool IsBitSet1(unsigned char byte, int index) { return (byte (1index)) != 0; } bool IsBitSet2(unsigned char byte, int index) { return (byte index) 1; } From GCC 4.9: IsBitSet1(unsigned char, int): movecx, esi moveax, 1 movzx edi, dil saleax, cl test eax, edi setne al ret IsBitSet2(unsigned char, int): movzx eax, dil movecx, esi sareax, cl andeax, 1 ret From Clang 3.3: IsBitSet1(unsigned char, int): btl%esi, %edi setb %al ret IsBitSet2(unsigned char, int): btl%esi, %edi setb %al ret
[Bug rtl-optimization/46235] inefficient bittest code generation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #2 from Tony Poppleton tony.poppleton at gmail dot com 2011-01-28 16:55:48 UTC --- Based on Richard's comment, I tried a modified version of the code replacing the (1 x) with just (16). This shows that GCC (4.6 4.5.2) does perform an optimization similar to llvm, and uses the testb instruction: movl%edi, %eax movl$1, %edx testb $16, %al cmove %edx, %eax ret Therefore, perhaps it would be beneficial to not convert from a (n x) to (a x) n, in the special case where the value n is 1 (or a power of 2 potentially)? Incidentally, the above code could have been optimized further to remove the usage of edx entirely (will make a separate PR about that)
[Bug rtl-optimization/46235] inefficient bittest code generation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #3 from Tony Poppleton tony.poppleton at gmail dot com 2011-01-28 17:02:50 UTC --- Actually what I said above isn't correct - had it compiled down to bt $4, %al then it would make sense to do that special case, but as it used testb it is inconclusive.
[Bug rtl-optimization/46235] inefficient bittest code generation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #4 from Tony Poppleton tony.poppleton at gmail dot com 2011-01-28 18:08:15 UTC --- As a quick test, I commented out the block with the following comment in fold-const.c: /* If this is an EQ or NE comparison with zero and ARG0 is (1 foo) bar, convert it to (bar foo) 1. Both require two operations, but the latter can be done in one less insn on machines that have only two-operand insns or on which a constant cannot be the first operand. */ This produces the following asm code: movl$1, %edx movl%edi, %eax movl%esi, %ecx movl%edx, %edi sall%cl, %edi testl %eax, %edi cmove %edx, %eax ret (using modified GCC 4.6.0 20110122) So whilst I was hoping for an easy quick-fix, it appears that the required optimization to convert it into a btl test isn't there later on in the compile. Incidentally, from looking at http://gmplib.org/~tege/x86-timing.pdf, it appears that bt is slow on P4 architecture (8 cycles if I am reading it correctly? which sounds slow), so the llvm code in the bug description isn't necessarily an optimization on this arch. Newer chips would probably still benefit though.
[Bug rtl-optimization/46235] inefficient bittest code generation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-30 18:21:48 UTC --- We canonicalize a (1 x) to (a x) 1 very early.