https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2023-06-02 00:00:00 |2026-3-27
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'll note that cost comparing should make this easier to improve but in the end
it would be BB vectorization that's most profitable. With
-fno-tree-loop-optimize -mavx2 we get
.L4:
vmovss (%rdx), %xmm2
vpinsrw $0, (%rax), %xmm0, %xmm0
movzbl 2(%rax), %ecx
addq $4, %rdx
vpmovzxbd %xmm0, %xmm0
addq $4, %rax
vmovq %xmm0, %xmm0
vbroadcastss %xmm2, %xmm4
vcvtdq2ps %xmm0, %xmm0
vmulps %xmm4, %xmm0, %xmm0
vaddps %xmm1, %xmm0, %xmm1
vcvtsi2ssl %ecx, %xmm5, %xmm0
vmulss %xmm2, %xmm0, %xmm0
vaddss %xmm0, %xmm3, %xmm3
cmpq %rdx, %rsi
jne .L4
(I'll note the testcase has uninitialized 'pixel'), it's only missing
using FMAs.