https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
A quick prototype, for comment #11 now has

t2.c:8:10: note: Cost model analysis for part in loop 0:
  Vector cost: 156
  Scalar cost: 184

doing the following with -O2 -mfma:

foo:
.LFB0:
        .cfi_startproc
        subq    $24, %rsp
        .cfi_def_cfa_offset 32
        vmovq   (%rdi), %xmm1
        vmovq   (%rsi), %xmm2
        vmovshdup       %xmm1, %xmm3
        vmovsldup       %xmm1, %xmm0
        vshufps $0xe1, %xmm2, %xmm2, %xmm4
        vmovq   %xmm4, %xmm4
        vmovq   %xmm3, %xmm3
        vmovq   %xmm0, %xmm0
        vmulps  %xmm4, %xmm3, %xmm3
        vmovq   %xmm2, %xmm4
        vmovq   %xmm3, %xmm3
        vfmaddsub132ps  %xmm4, %xmm3, %xmm0
        vmovaps %xmm0, %xmm3
        vmovshdup       %xmm0, %xmm0
        vucomiss        %xmm0, %xmm3
        jp      .L5
.L2:
        vmovshdup       %xmm3, %xmm5
        vmovss  %xmm3, 8(%rsp)
        vmovss  %xmm5, 12(%rsp)
        vmovq   8(%rsp), %xmm0
        addq    $24, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        ret
.L5:
        .cfi_restore_state
        vmovaps %xmm1, %xmm0
        vmovshdup       %xmm2, %xmm3
        vmovshdup       %xmm1, %xmm1
        call    __mulsc3
        vmovdqa %xmm0, %xmm3
        vshufps $85, %xmm0, %xmm0, %xmm0
        vunpcklps       %xmm0, %xmm3, %xmm3
        jmp     .L2

It shows we now cost vector FMADDSUB (12) but do not anticipate scalar
FMADD/FMSUB use, over-costing the scalar side (2*16 + 2*12).  The live
lane extractions are cheap, but the original scalar code might still
be considered better:

foo:
.LFB0:
        .cfi_startproc
        subq    $24, %rsp
        .cfi_def_cfa_offset 32
        vmovss  4(%rdi), %xmm1
        vmovss  (%rsi), %xmm2
        vmovss  4(%rsi), %xmm3
        vmovss  (%rdi), %xmm5
        vmulss  %xmm2, %xmm1, %xmm0
        vmulss  %xmm3, %xmm1, %xmm4
        vfmadd231ss     %xmm3, %xmm5, %xmm0
        vfmsub231ss     %xmm2, %xmm5, %xmm4
        vucomiss        %xmm0, %xmm4
        jp      .L5
.L2:
.L2:
        vmovss  %xmm4, 8(%rsp)
        vmovss  %xmm0, 12(%rsp)
        vmovq   8(%rsp), %xmm0
        addq    $24, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        ret
.L5:
        .cfi_restore_state
        vmovaps %xmm5, %xmm0
        call    __mulsc3
        vmovdqa %xmm0, %xmm4
        vshufps $85, %xmm0, %xmm0, %xmm0
        jmp     .L2

Even with fast complex we get:

foo:
.LFB0:
        .cfi_startproc
        vmovq   (%rdi), %xmm0
        vmovq   (%rsi), %xmm2
        vmovsldup       %xmm0, %xmm1
        vmovshdup       %xmm0, %xmm0
        vshufps $0xe1, %xmm2, %xmm2, %xmm3
        vmulps  %xmm3, %xmm0, %xmm0
        vfmaddsub231ps  %xmm2, %xmm1, %xmm0
        vmovshdup       %xmm0, %xmm4
        vmovss  %xmm0, -8(%rsp)
        vmovss  %xmm4, -4(%rsp)
        vmovq   -8(%rsp), %xmm0
        ret

possibly with AVX512 embedded broadcast from memory would be better for
the two splats.


For the testcase in the description unpatched trunk emits unvectorized
fmas in both the outline and inline copies if you do not call main
main, otherwise we optimize the inline copy for size, emitting a libcall
only.

So I think the original report was fixed at some point in GCC 15, whatever
"always" means.  With GCC 14 I can still see vectorization but with
vaddsubps instead of FMA.  With GCC 15 and -fno-vect-cost-model I see
vfmaddsub123ps used in mul but not in renamed main where vfmadd231ss
and vfmsub231ss is used.  That missing optimization to vectorize in main()
(renamed as foo) remains with my costmodel patch which is due to the
missing SLP vectorization root there.  There is an effective store in 'mul'
for this.

Reply via email to