https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68923
Bug ID: 68923 Summary: SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx Product: gcc Version: 5.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- context and background: http://stackoverflow.com/questions/34279513/loading-8-chars-from-memory-into-an-m256-variable-as-packed-single-precision-f Using intrinsics, I can't find a way to get gcc to emit VPMOVZXBD (%rsi), %ymm0 ; 64b load VCVTDQ2PS %ymm0, %ymm0 without using _mm_loadu_si128, which will compile to an actual 128b load with -O0. (not counting evil use of #ifndef __OPTIMIZE__ to do it two different ways, of course). Since there is no intrinsic for PMOVSX / PMOVZX as a load from a narrower memory location, the only way I can see to correctly write this with intrinsics involves _mm_cvtsi64_si128 (MOVQ), which I don't even want the compiler to emit. clang3.6 and ICC13 compile this to the optimal sequence, still folding the load into VPMOVZXBD, but gcc doesn't. #include <immintrin.h> #include <stdint.h> #define USE_MOVQ __m256 load_bytes_to_m256(uint8_t *p) { #ifdef USE_MOVQ // compiles to an actual movq then pmovzx xmm,xmm with gcc -O3 __m128i small_load = _mm_cvtsi64_si128( *(uint64_t*)p ); #else // loadu compiles to a 128b load with gcc -O0, potentially segfaulting __m128i small_load = _mm_loadu_si128( (__m128i*)p ); #endif __m256i intvec = _mm256_cvtepu8_epi32( small_load ); return _mm256_cvtepi32_ps(intvec); } Problem 1: g++ -O3 -march=haswell emits (gcc 5.3.0 on godbolt) load_bytes_to_m256(unsigned char*): vmovq (%rdi), %xmm0 vpmovzxbd %xmm0, %ymm0 vcvtdq2ps %ymm0, %ymm0 ret Problem 2: gcc and clang don't even provide that movq intrinsic in 32bit mode. (Split into a separate bug, since it's totally separate from the missing optimization issue).