https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

            Bug ID: 80837
           Summary: [7.1.0 regression] x86 accessing a member of a 16-byte
                    atomic object generates terrible code:
                    splitting/merging the bytes
           Product: gcc
           Version: 7.1.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

This code compiles to much worse asm on gcc7.1.0 than on gcc6 or gcc8, doing an
insane split/merge of separate bytes of the atomic load result.  Sorry I can't
easily check if it's fixed in the latest gcc7, and I don't know what to search
for the check for an existing report.

#include <atomic>
#include <stdint.h>
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
    node *ptr;    // non-atomic pointer-to-atomic
    uintptr_t count;
    //int foo;
};

node *load_nounion(std::atomic<counted_ptr> *p) {
  return p->load(std::memory_order_acquire).ptr;
}

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load
of just the requested member.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for the store-forwarding
stalls from a bad int->xmm strategy when compiling this with -m32.

With gcc7.1.0 -mx32 -O3 -march=nehalem (https://godbolt.org/g/jeCwsP)


        movq    (%edi), %rcx   # 8-byte load of the whole object
        xorl    %edx, %edx     # and then split/merge the bytes
        movzbl  %ch, %eax
        movb    %cl, %dl
        movq    %rcx, %rsi
        movb    %al, %dh       # put the 2nd byte back where it belongs, after
getting it into the low byte of eax for no reason.  This could have been movb
%ch, %dh.
        andl    $16711680, %esi
        andl    $4278190080, %ecx
        movzwl  %dx, %eax      # partial-register stall on reading %dx; 3 cycle
stall on core2/nehalem to insert a merging uop (and even compiling with
-march=core2 doesn't change the strategy to use an extra OR insn)
        orq     %rsi, %rax
        orq     %rcx, %rax     # why 64-bit operand-size?
        ret

-m32 uses basically the same code, but with 32-bit OR at the end.  It does the
8-byte atomic load using SSE or x87.  (And then stores/reloads instead of using
movq %xmm0, %eax, but that's a separate missed-optimization)

-m64 (without -msse4) uses a lot of movabs to set up masks for the high bytes
to split/merge all 8.  (The 16B atomic load is done with a call to libatomic
instead of inlining cmpxchg16b since this patch:
https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html).

---

-msse4 results in a different strategy for a useless 8-byte split/merge:

The atomic-load result in rdx:rax is stored to the stack, then reloaded into
xmm1, resulting in a store-forwarding stall.  Looks like the 64-bit version of
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833: bad int->xmm strategies.


Then the insane byte-at-a-time copy is from the low 8 bytes of %xmm1 into
%xmm0, using pextrb and pinsrb (I'm not going to paste in the whole code), see
https://godbolt.org/g/SCUIeu.

It initializes %xmm0 with
  movq    $0, 16(%rsp)
  movq    $0, 24(%rsp)
  movb    %r10b, 16(%rsp)  # r10 holds the first byte of the source data
  movdqa  16(%rsp), %xmm0
which is obviously horrible compared to movd %r10d, %xmm0.  Or movzbl %al,%eax
/ movd %eax,%xmm0 instead of messing around with %r10 at all. (%r10d is was
written with movzbl (%rsp), %r10d, reloading from the store of RAX.)

IDK how gcc managed to produce code with so many of the small steps done in
horribly sub-optimal ways, nevermind the fact that it's all useless in the
first place.

Reply via email to