[Bug target/87317] New: Missed optimisation: merging VMOVQ with operations that only use the low 8 bytes

thiago at kde dot org Fri, 14 Sep 2018 22:27:07 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87317


            Bug ID: 87317
           Summary: Missed optimisation: merging VMOVQ with operations
                    that only use the low 8 bytes
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thiago at kde dot org
  Target Milestone: ---

Test:

#include <immintrin.h>

int f(void *ptr)
{
    __m128i data = _mm_loadl_epi64((__m128i *)ptr);
    data = _mm_cvtepu8_epi16(data);
    return _mm_cvtsi128_si32(data);
}

GCC generates (-march=haswell or -march=skylake):

        vmovq   (%rdi), %xmm0
        vpmovzxbw       %xmm0, %xmm0
        vmovd   %xmm0, %eax
        ret

Note that the VPMOVZXBW instruction only reads the low 8 bytes from the source,
including if it is a memory reference. Both Clang and ICC generate:

        vpmovzxbw       (%rdi), %xmm0
        vmovd   %xmm0, %eax
        retq

Similarly for:

void f(void *dst, void *ptr)
{
    __m128i data = _mm_cvtsi32_si128(*(int*)ptr);
    data = _mm_cvtepu8_epi32(data);
    _mm_storeu_si128((__m128i*)dst, data);
}

GCC:

        vmovd   (%rsi), %xmm0
        vpmovzxbd       %xmm0, %xmm0
        vmovups %xmm0, (%rdi)
        ret

Clang and ICC:

        vpmovzxbd       (%rsi), %xmm0
        vmovdqu %xmm0, (%rdi)
        retq

There are other instructions that might benefit from this.

AVX-512 memory instructions where the OpMask is a constant might be candidates
too.

[Bug target/87317] New: Missed optimisation: merging VMOVQ with operations that only use the low 8 bytes

Reply via email to