https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837
Bug ID: 80837 Summary: [7.1.0 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes Product: gcc Version: 7.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* This code compiles to much worse asm on gcc7.1.0 than on gcc6 or gcc8, doing an insane split/merge of separate bytes of the atomic load result. Sorry I can't easily check if it's fixed in the latest gcc7, and I don't know what to search for the check for an existing report. #include <atomic> #include <stdint.h> struct node; struct alignas(2*sizeof(void*)) counted_ptr { node *ptr; // non-atomic pointer-to-atomic uintptr_t count; //int foo; }; node *load_nounion(std::atomic<counted_ptr> *p) { return p->load(std::memory_order_acquire).ptr; } See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load of just the requested member. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for the store-forwarding stalls from a bad int->xmm strategy when compiling this with -m32. With gcc7.1.0 -mx32 -O3 -march=nehalem (https://godbolt.org/g/jeCwsP) movq (%edi), %rcx # 8-byte load of the whole object xorl %edx, %edx # and then split/merge the bytes movzbl %ch, %eax movb %cl, %dl movq %rcx, %rsi movb %al, %dh # put the 2nd byte back where it belongs, after getting it into the low byte of eax for no reason. This could have been movb %ch, %dh. andl $16711680, %esi andl $4278190080, %ecx movzwl %dx, %eax # partial-register stall on reading %dx; 3 cycle stall on core2/nehalem to insert a merging uop (and even compiling with -march=core2 doesn't change the strategy to use an extra OR insn) orq %rsi, %rax orq %rcx, %rax # why 64-bit operand-size? ret -m32 uses basically the same code, but with 32-bit OR at the end. It does the 8-byte atomic load using SSE or x87. (And then stores/reloads instead of using movq %xmm0, %eax, but that's a separate missed-optimization) -m64 (without -msse4) uses a lot of movabs to set up masks for the high bytes to split/merge all 8. (The 16B atomic load is done with a call to libatomic instead of inlining cmpxchg16b since this patch: https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html). --- -msse4 results in a different strategy for a useless 8-byte split/merge: The atomic-load result in rdx:rax is stored to the stack, then reloaded into xmm1, resulting in a store-forwarding stall. Looks like the 64-bit version of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833: bad int->xmm strategies. Then the insane byte-at-a-time copy is from the low 8 bytes of %xmm1 into %xmm0, using pextrb and pinsrb (I'm not going to paste in the whole code), see https://godbolt.org/g/SCUIeu. It initializes %xmm0 with movq $0, 16(%rsp) movq $0, 24(%rsp) movb %r10b, 16(%rsp) # r10 holds the first byte of the source data movdqa 16(%rsp), %xmm0 which is obviously horrible compared to movd %r10d, %xmm0. Or movzbl %al,%eax / movd %eax,%xmm0 instead of messing around with %r10 at all. (%r10d is was written with movzbl (%rsp), %r10d, reloading from the store of RAX.) IDK how gcc managed to produce code with so many of the small steps done in horribly sub-optimal ways, nevermind the fact that it's all useless in the first place.