[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

peter at cordes dot ca Sun, 13 Oct 2019 07:01:56 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080


            Bug ID: 92080
           Summary: Missed CSE of _mm512_set1_epi8(c) with
                    _mm256_set1_epi8(c)
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

As a workaround for PR 82887 some code (e.g. a memset) uses

__m512i zmm = _mm512_set1_epi8((char)c);
__m256i ymm = _mm256_set1_epi8((char)c);

instead of 

  ymm = _mm512_castsi512_si256(zmm);

(found in the persistent-memory library
https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193
)

Obviously we'd like to CSE that instead of actually broadcasting twice.  MVCE:

#include <immintrin.h>

__m512i sinkz;
__m256i sinky;
void foo(char c) {
    sinkz = _mm512_set1_epi8(c);
    sinky = _mm256_set1_epi8(c);
}

https://godbolt.org/z/CeXhi8  g++ (Compiler-Explorer-Build) 10.0.0 20191012

# g++ -O3 -march=skylake-avx512  (AVX512BW + AVX512VL are the relevant ones)
foo(char):
        vpbroadcastb    %edi, %zmm0
        vmovdqa64       %zmm0, sinkz(%rip)
        vpbroadcastb    %edi, %ymm0          # wasted insn
        vmovdqa64       %ymm0, sinky(%rip)   # wasted EVEX prefix
        vzeroupper
        ret

Without AVX512VL it wastes even more instructions (vmovd + AVX2 vpbroadcastb
xmm,ymm), even though AVX512BW vpbroadcastb zmm does set the YMM register. 
(There are no CPUs with AVX512BW but not AVX512VL; if people compile that way
it's their own fault.  But this might be relevant for set1_epi32() on KNL).

Clang finds this optimization, and uses a shorter vmovdqa for the YMM store
saving another 2 bytes of code size:

        vpbroadcastb    %edi, %zmm0
        vmovdqa64       %zmm0, sinkz(%rip)
        vmovdqa         %ymm0, sinky(%rip)
        vzeroupper
        ret

[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

Reply via email to