https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122746
vekumar at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |vekumar at gcc dot gnu.org
--- Comment #1 from vekumar at gcc dot gnu.org ---
GCC 16 (1/1/26)
--Snip--
.L4:
vmovupd (%rsi,%rax), %zmm3
vmovupd 64(%rsi,%rax), %zmm2
vaddpd %xmm3, %xmm1, %xmm1
vextracti32x4 $1, %zmm3, %xmm4
vaddpd %xmm4, %xmm1, %xmm1
vextracti32x4 $2, %zmm3, %xmm4
vextracti32x4 $3, %zmm3, %xmm3
vaddpd %xmm4, %xmm1, %xmm1
vaddpd %xmm3, %xmm1, %xmm1
--Snip--
On Zen4/5, the GCC trunk code is bad and generating high latency vextracti32x4
(5 cycles) to do in-order reduction. On these targets wider to narrow
operations should be costed more and better avoided.
GCC 15 uses "insertf64x2". Inserts are cheaper and vectorizing at YMM level
seems better here.
vmovsd (%rdx), %xmm0
vmovhpd 8(%rdx), %xmm0, %xmm2 <== This can be optimized to single
load.
vmovupd (%rax), %xmm0
vinsertf64x2 $0x1, %xmm2, %ymm0, %ymm0
vaddpd %ymm1, %ymm0, %ymm0