https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
Bug ID: 92080 Summary: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c) Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* As a workaround for PR 82887 some code (e.g. a memset) uses __m512i zmm = _mm512_set1_epi8((char)c); __m256i ymm = _mm256_set1_epi8((char)c); instead of ymm = _mm512_castsi512_si256(zmm); (found in the persistent-memory library https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193 ) Obviously we'd like to CSE that instead of actually broadcasting twice. MVCE: #include <immintrin.h> __m512i sinkz; __m256i sinky; void foo(char c) { sinkz = _mm512_set1_epi8(c); sinky = _mm256_set1_epi8(c); } https://godbolt.org/z/CeXhi8 g++ (Compiler-Explorer-Build) 10.0.0 20191012 # g++ -O3 -march=skylake-avx512 (AVX512BW + AVX512VL are the relevant ones) foo(char): vpbroadcastb %edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vpbroadcastb %edi, %ymm0 # wasted insn vmovdqa64 %ymm0, sinky(%rip) # wasted EVEX prefix vzeroupper ret Without AVX512VL it wastes even more instructions (vmovd + AVX2 vpbroadcastb xmm,ymm), even though AVX512BW vpbroadcastb zmm does set the YMM register. (There are no CPUs with AVX512BW but not AVX512VL; if people compile that way it's their own fault. But this might be relevant for set1_epi32() on KNL). Clang finds this optimization, and uses a shorter vmovdqa for the YMM store saving another 2 bytes of code size: vpbroadcastb %edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vmovdqa %ymm0, sinky(%rip) vzeroupper ret