https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122746
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to vekumar from comment #1) > GCC 16 (1/1/26) > --Snip-- > .L4: > vmovupd (%rsi,%rax), %zmm3 > vmovupd 64(%rsi,%rax), %zmm2 > vaddpd %xmm3, %xmm1, %xmm1 > vextracti32x4 $1, %zmm3, %xmm4 > vaddpd %xmm4, %xmm1, %xmm1 > vextracti32x4 $2, %zmm3, %xmm4 > vextracti32x4 $3, %zmm3, %xmm3 > vaddpd %xmm4, %xmm1, %xmm1 > vaddpd %xmm3, %xmm1, %xmm1 > --Snip-- > > On Zen4/5, the GCC trunk code is bad and generating high latency > vextracti32x4 (5 cycles) to do in-order reduction. On these targets wider to > narrow operations should be costed more and better avoided. > > GCC 15 uses "insertf64x2". Inserts are cheaper and vectorizing at YMM level > seems better here. > > vmovsd (%rdx), %xmm0 > vmovhpd 8(%rdx), %xmm0, %xmm2 <== This can be optimized to single > load. > vmovupd (%rax), %xmm0 > vinsertf64x2 $0x1, %xmm2, %ymm0, %ymm0 > vaddpd %ymm1, %ymm0, %ymm0 This isn't code for the testcase in this bug?
