https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68923

            Bug ID: 68923
           Summary: SSE/AVX movq load (_mm_cvtsi64_si128) not being folded
                    into pmovzx
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

context and background:
http://stackoverflow.com/questions/34279513/loading-8-chars-from-memory-into-an-m256-variable-as-packed-single-precision-f

Using intrinsics, I can't find a way to get gcc to emit

    VPMOVZXBD   (%rsi), %ymm0   ; 64b load
    VCVTDQ2PS   %ymm0,  %ymm0

without using _mm_loadu_si128, which will compile to an actual 128b load with
-O0.  (not counting evil use of #ifndef __OPTIMIZE__ to do it two different
ways, of course).


Since there is no intrinsic for PMOVSX / PMOVZX as a load from a narrower
memory location, the only way I can see to correctly write this with intrinsics
involves _mm_cvtsi64_si128 (MOVQ), which I don't even want the compiler to
emit.  clang3.6 and ICC13 compile this to the optimal sequence, still folding
the load into VPMOVZXBD, but gcc doesn't.


#include <immintrin.h>
#include <stdint.h>
#define USE_MOVQ
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef  USE_MOVQ  // compiles to an actual movq then pmovzx xmm,xmm with gcc
-O3
    __m128i small_load = _mm_cvtsi64_si128( *(uint64_t*)p );
#else  // loadu compiles to a 128b load with gcc -O0, potentially segfaulting
    __m128i small_load = _mm_loadu_si128( (__m128i*)p );
#endif

    __m256i intvec = _mm256_cvtepu8_epi32( small_load );
    return _mm256_cvtepi32_ps(intvec);
}



Problem 1: g++ -O3 -march=haswell emits (gcc 5.3.0 on godbolt)

load_bytes_to_m256(unsigned char*):
        vmovq   (%rdi), %xmm0
        vpmovzxbd       %xmm0, %ymm0
        vcvtdq2ps       %ymm0, %ymm0
        ret


Problem 2:
 gcc and clang don't even provide that movq intrinsic in 32bit mode.

(Split into a separate bug, since it's totally separate from the missing
optimization issue).

Reply via email to