https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2023-06-02 00:00:00         |2026-3-27

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'll note that cost comparing should make this easier to improve but in the end
it would be BB vectorization that's most profitable.  With
-fno-tree-loop-optimize -mavx2 we get

.L4:
        vmovss  (%rdx), %xmm2
        vpinsrw $0, (%rax), %xmm0, %xmm0
        movzbl  2(%rax), %ecx
        addq    $4, %rdx
        vpmovzxbd       %xmm0, %xmm0
        addq    $4, %rax
        vmovq   %xmm0, %xmm0
        vbroadcastss    %xmm2, %xmm4
        vcvtdq2ps       %xmm0, %xmm0
        vmulps  %xmm4, %xmm0, %xmm0
        vaddps  %xmm1, %xmm0, %xmm1
        vcvtsi2ssl      %ecx, %xmm5, %xmm0
        vmulss  %xmm2, %xmm0, %xmm0
        vaddss  %xmm0, %xmm3, %xmm3
        cmpq    %rdx, %rsi
        jne     .L4

(I'll note the testcase has uninitialized 'pixel'), it's only missing
using FMAs.

Reply via email to