https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819

            Bug ID: 80819
           Summary: [5/6/7/8 regression] Useless store to the stack  in
                    _mm_set_epi64x with SSE4 -mno-avx
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

#include <immintrin.h>
__m128i combine64(long long a, long long b) {
  return _mm_set_epi64x(b,a);
}

gcc5/6/7/8-snapshot with -O3 -msse4 -mtune=haswell emits:

        movq    %rdi, %xmm0
        movq    %rsi, -16(%rsp)   # dead store into the red-zone
        pinsrq  $1, %rsi, %xmm0

The same thing happens with -mtune=generic -msse4: it stores both halves to
memory, but only reloads the first half.  The upper half is transferred with
pinsrq

        movq    %rdi, -16(%rsp)
        movq    %rsi, -24(%rsp)   # dead store
        movq    -16(%rsp), %xmm0
        pinsrq  $1, %rsi, %xmm0


-mavx avoids the useless store, for tune=generic and tune=haswell.

This is a left-over from the store/reload strategy it uses without SSE4 (which
is worse than movq/movq/punpcklqdq, but that's a separate bug):

        movq    %rsi, -16(%rsp)
        movq    %rdi, %xmm0
        movhps  -16(%rsp), %xmm0

It's a regression from gcc4.x, where we get the expected good sequence for
-msse4 -mtune=haswell.

        movq    %rdi, %xmm0
        pinsrq  $1, %rsi, %xmm0


---------------

It doesn't happen for _mm_set_epi32.  e.g.
__m128i combine32(int a, int b, int c, int d) {
  return _mm_set_epi32(d,c,b,a);
}

compiles (with -mtune=haswell -msse4) to code that looks good to me.
        movd    %edx, %xmm1
        movd    %edi, %xmm0
        pinsrd  $1, %ecx, %xmm1
        pinsrd  $1, %esi, %xmm0
        punpcklqdq      %xmm1, %xmm0

clang uses 1 movd and 3x pinsrd, which is 2 bytes shorter and also 7 uops for
port5 on Haswell, but has less slightly ILP.  (On CPUs where pinsrd is 2 uops,
the first one is probably an int->vector uop that can run before the
destination vector is ready.)

-mtune=generic still stores/reloads instead of using movd for %edi and %edx,
which is worse for most CPUs.  (Which is a bug, IMO: I'll file a separate bug
for that.)  But it does then use pinsrd with a register source for %ecx and
%esi, instead of a store/reload there.

Reply via email to