https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #6 from Peter Cordes <peter at cordes dot ca> --- This seems to be partially fixed in gcc8.0: #include <stdint.h> uint64_t btr_variable(uint64_t x, unsigned bit) { //bit = 53; // produces btr in older gcc, too. return x & ~(1ULL << bit); } movq %rdi, %rax btrq %rsi, %rax ret vs. gcc7.2 -O3 -mtune=haswell movl %esi, %ecx movq $-2, %rdx rolq %cl, %rdx movq %rdx, %rax # this is dumb, should have put the mask in rax in the first place andq %rdi, %rax ret Or with bit=53: movabsq $-9007199254740993, %rax andq %rdi, %rax ret btr $53, %rax only has 2 per clock throughput instead of 4 per clock for AND, but a 10-byte mov instruction to set up the constant is almost never going to be worth it for -mtune=haswell. It takes up extra slots in the uop cache. --- The inner loop from the Matthias's attached program *really* confuses gcc, so badly that it never gets to the btr pattern, apparently. unsigned long cfunc_one(unsigned long tmp) { for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 3) { tmp &= ~(1UL << bit); } return tmp; } movq %rdi, %rax xorl %ecx, %ecx movl $1, %esi .L5: movq %rsi, %rdx # start with 1UL every time salq %cl, %rdx addq $3, %rcx notq %rdx # what happened to rotating -2? andq %rdx, %rax cmpq $66, %rcx jne .L5 ret This is obviously horrible, but the right answer isn't btr in a loop, it's what clang does: movabsq $7905747460161236406, %rax # imm = 0x6DB6DB6DB6DB6DB6 every third bit unset andq %rdi, %rax retq gcc does spot this with `bit += 7`, I guess because with fewer iterations it decides to try fully unrolling and then can optimize. With a constant shift count and an inline function call, gcc manages to get really confused auto-vectorizing the loop: uint64_t btr64(uint64_t x, unsigned bit) { bit = 53; return x & ~(1ULL << bit); } unsigned long cfunc_one(unsigned long tmp) { for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 7) { //tmp &= ~(1UL << bit); tmp = btr64(tmp, bit); } return tmp; } movdqa .LC0(%rip), %xmm0 # constant with both halves the same. movdqa %xmm0, %xmm1 psrldq $8, %xmm1 pand %xmm1, %xmm0 movq %xmm0, %rax # The above is equivalent to mov .LC0(%rip), %rax andq %rdi, %rax ret (In reply to Richard Biener from comment #1) > Can you provide a testcase that can be compiled please? > > Cut&pasting from i386.md: > > ;; %%% bts, btr, btc, bt. > ;; In general these instructions are *slow* when applied to memory, > ;; since they enforce atomic operation. This error is fixed in the current version https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/i386/i386.md. They're slow because of crazy-CISC semantics, and aren't atomic without a lock prefix. btr %rax, (%rdi) uses %rax as a bit index into memory relative to %rdi, so the actual byte or dword or qword eventually accessed is *not* determined by the addressing mode alone. It's micro-coded as several uops. > When applied to registers, > ;; it depends on the cpu implementation. They're never faster than > ;; the corresponding and/ior/xor operations, so with 32-bit there's > ;; no point. But in 64-bit, we can't hold the relevant immediates > ;; within the instruction itself, so operating on bits in the high > ;; 32-bits of a register becomes easier. This section is talking about using it with an immediate operand like btr $53, %rax because and $imm64, %rax doesn't exist, only and $sign_extended_imm32, %rax Does `(set_attr "type" "alu1")` mean gcc thinks it only has 1 per clock throughput? Or that it competes with other "alu1" instructions? On Intel since Sandybridge, bt/btr/bts/btc reg,reg or imm,reg is 2 per clock. It's 1 per clock on Bulldozer-family and Jaguar, 2 per clock on Ryzen. On Silvermont / KNL, they're 1 per clock occupying both integer ports.