https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819
Bug ID: 80819 Summary: [5/6/7/8 regression] Useless store to the stack in _mm_set_epi64x with SSE4 -mno-avx Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include <immintrin.h> __m128i combine64(long long a, long long b) { return _mm_set_epi64x(b,a); } gcc5/6/7/8-snapshot with -O3 -msse4 -mtune=haswell emits: movq %rdi, %xmm0 movq %rsi, -16(%rsp) # dead store into the red-zone pinsrq $1, %rsi, %xmm0 The same thing happens with -mtune=generic -msse4: it stores both halves to memory, but only reloads the first half. The upper half is transferred with pinsrq movq %rdi, -16(%rsp) movq %rsi, -24(%rsp) # dead store movq -16(%rsp), %xmm0 pinsrq $1, %rsi, %xmm0 -mavx avoids the useless store, for tune=generic and tune=haswell. This is a left-over from the store/reload strategy it uses without SSE4 (which is worse than movq/movq/punpcklqdq, but that's a separate bug): movq %rsi, -16(%rsp) movq %rdi, %xmm0 movhps -16(%rsp), %xmm0 It's a regression from gcc4.x, where we get the expected good sequence for -msse4 -mtune=haswell. movq %rdi, %xmm0 pinsrq $1, %rsi, %xmm0 --------------- It doesn't happen for _mm_set_epi32. e.g. __m128i combine32(int a, int b, int c, int d) { return _mm_set_epi32(d,c,b,a); } compiles (with -mtune=haswell -msse4) to code that looks good to me. movd %edx, %xmm1 movd %edi, %xmm0 pinsrd $1, %ecx, %xmm1 pinsrd $1, %esi, %xmm0 punpcklqdq %xmm1, %xmm0 clang uses 1 movd and 3x pinsrd, which is 2 bytes shorter and also 7 uops for port5 on Haswell, but has less slightly ILP. (On CPUs where pinsrd is 2 uops, the first one is probably an int->vector uop that can run before the destination vector is ready.) -mtune=generic still stores/reloads instead of using movd for %edi and %edx, which is worse for most CPUs. (Which is a bug, IMO: I'll file a separate bug for that.) But it does then use pinsrd with a register source for %ecx and %esi, instead of a store/reload there.