https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87317
Bug ID: 87317 Summary: Missed optimisation: merging VMOVQ with operations that only use the low 8 bytes Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Test: #include <immintrin.h> int f(void *ptr) { __m128i data = _mm_loadl_epi64((__m128i *)ptr); data = _mm_cvtepu8_epi16(data); return _mm_cvtsi128_si32(data); } GCC generates (-march=haswell or -march=skylake): vmovq (%rdi), %xmm0 vpmovzxbw %xmm0, %xmm0 vmovd %xmm0, %eax ret Note that the VPMOVZXBW instruction only reads the low 8 bytes from the source, including if it is a memory reference. Both Clang and ICC generate: vpmovzxbw (%rdi), %xmm0 vmovd %xmm0, %eax retq Similarly for: void f(void *dst, void *ptr) { __m128i data = _mm_cvtsi32_si128(*(int*)ptr); data = _mm_cvtepu8_epi32(data); _mm_storeu_si128((__m128i*)dst, data); } GCC: vmovd (%rsi), %xmm0 vpmovzxbd %xmm0, %xmm0 vmovups %xmm0, (%rdi) ret Clang and ICC: vpmovzxbd (%rsi), %xmm0 vmovdqu %xmm0, (%rdi) retq There are other instructions that might benefit from this. AVX-512 memory instructions where the OpMask is a constant might be candidates too.