https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82139
Bug ID: 82139 Summary: unnecessary movapd with _mm_castsi128_pd to use BLENDPD on __m128i results Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include <immintrin.h> #include <stdint.h> // stripped down from a real function that did something more useful void foo(uint64_t blocks[]) { for (int i = 0 ; i<10240 ; i+=2) { __m128i v = _mm_loadu_si128((__m128i*)&blocks[i]); __m128i t1 = _mm_add_epi32(v, _mm_set1_epi32(1)); __m128i t2 = _mm_add_epi32(v, _mm_set1_epi32(-1)); __m128d blend = _mm_blend_pd(_mm_castsi128_pd(t1), _mm_castsi128_pd(t2), 2); // is this even aliasing-safe? Could cast back to __m128i _mm_storeu_pd((double*)(__m128d*)&blocks[i], blend); } } https://godbolt.org/g/im1kcc for source and gcc-trunk asm output (and the slightly larger version of this function that I simplified). blendpd/blendps have better throughput than pblendw on Intel CPUs, so I played with that in this function I was looking at. gcc4.8 and later waste a MOVAPD for no reason instead of clobbering one of the PADDD results with the blend. (The larger version of this function, pairs_u64_sse2 in the godbolt link, avoids the extra MOVAPD with gcc4.9.4 and earlier, but not in foo(). So maybe it's just by chance, or maybe 4.8 changed something. Anyway, still present in 7.2 and 8.0-trunk, and with -O2 or -O3 (GCC-Explorer-Build) 8.0.0 20170907 -xc -std=gnu99 -O3 -Wall -msse4 -mno-avx foo: pcmpeqd %xmm2, %xmm2 leaq 81920(%rdi), %rax movdqa .LC0(%rip), %xmm3 .L6: movdqa %xmm3, %xmm1 addq $16, %rdi movdqu -16(%rdi), %xmm0 paddd %xmm0, %xmm1 movapd %xmm1, %xmm4 paddd %xmm2, %xmm0 blendpd $2, %xmm0, %xmm4 movups %xmm4, -16(%rdi) cmpq %rdi, %rax jne .L6 rep ret Notice that BLENDPD's operands aren't the two output registers from the PADDD instructions. Different versions/options (like -mtune=skylake) put the extra MOVAPD between the PADDD instructions, or right before BLENDPD, so don't let it fool you. :P With the function even simpler (like only one _mm_add_epi32), blending between the original load result and the add result didn't appear to have an extra MOVAPD