https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122746

vekumar at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vekumar at gcc dot gnu.org

--- Comment #1 from vekumar at gcc dot gnu.org ---
GCC 16 (1/1/26)
--Snip--
.L4:
        vmovupd (%rsi,%rax), %zmm3
        vmovupd 64(%rsi,%rax), %zmm2
        vaddpd  %xmm3, %xmm1, %xmm1
        vextracti32x4   $1, %zmm3, %xmm4
        vaddpd  %xmm4, %xmm1, %xmm1
        vextracti32x4   $2, %zmm3, %xmm4
        vextracti32x4   $3, %zmm3, %xmm3
        vaddpd  %xmm4, %xmm1, %xmm1
        vaddpd  %xmm3, %xmm1, %xmm1
--Snip--        

On Zen4/5, the GCC trunk code is bad and generating high latency vextracti32x4
(5 cycles) to do in-order reduction. On these targets wider to narrow
operations should be costed more and better avoided. 

GCC 15 uses "insertf64x2". Inserts are cheaper and vectorizing at YMM level
seems better here. 

vmovsd  (%rdx), %xmm0
vmovhpd 8(%rdx), %xmm0, %xmm2           <== This can be optimized to single
load.
vmovupd (%rax), %xmm0
vinsertf64x2    $0x1, %xmm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm0

Reply via email to