https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120398

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=122746
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2026-02-16
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
This could be vectorized with an in-order SLP reduction.  The issue with 15/16
is that we now can use larger vectors, but inefficiently, and that x86 does not
care to compare vector costs which would have made it chose the 8 byte vector
version from 14.  Mine for that part, you can see that effect with
--param ix86-vect-compare-costs=1 now:

> ./cc1 -quiet t.c -O2 -fopt-info-vec --param ix86-vect-compare-costs=1
t.c:8:12: optimized: loop vectorized using 8 byte vectors and unroll factor 1

.L3:
        movq    (%rdi,%rax,8), %xmm0
        movq    %xmm1, %xmm1
        addq    $1, %rax
        mulps   %xmm0, %xmm0
        movq    %xmm0, %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, %xmm1
        cmpq    %rax, %rsi
        jne     .L3
        shufps  $0xe5, %xmm0, %xmm0
        addss   %xmm1, %xmm0
        ret

so there's a workaround for GCC 16.

Reply via email to