https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636
Bug ID: 80636 Summary: AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm Product: gcc Version: 8.0 URL: http://stackoverflow.com/questions/43713273/is-vxorps- zeroing-on-amd-jaguar-bulldozer-zen-faster-with-xmm-re gisters-than-ymm Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* Currently, gcc compiles _mm256_setzero_ps() to vxorps %ymm0, %ymm0, %ymm0, or zmm for _mm512_setzero_ps. And similar for pd and integer vectors, using a vector size that matches how it's going to use the register. vxorps %xmm0, %xmm0, %xmm0 has the same effect, because AVX instructions zero the destination register out to VLMAX. AMD Ryzen decodes the xmm version to 1 micro-op, but the ymm version to 2 micro-ops. It doesn't detect the zeroing idiom special-case until after the decoder has split it. (Earlier AMD CPUs (Bulldozer/Jaguar) may be similar.) --- For zeroing a ZMM register, it also saves a byte or two to use a VEX prefix instead of EVEX, if the target register is zmm0-15. (zmm16-31 of course always need EVEX). --- There is no benefit, but also no downside, to using xmm-zeroing on Intel CPUs that don't split 256b or 512b vector ops. This change could be made across the board, without adding any tuning options to control it. References: http://stackoverflow.com/a/43751783/224132 Agner Fog's answer to my SO question about this. https://bugs.llvm.org/show_bug.cgi?id=32862 the same issue for clang.