https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86029
ktkachov at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |ktkachov at gcc dot gnu.org
--- Comment #2 from ktkachov at gcc dot gnu.org ---
(In reply to Tavian Barnes from comment #1)
> Maybe a dupe of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70291? In the
> -O3 version, __mulsc3() dominates the profile.
>
> │ for(int i=0; i<=decimate_taps_length; i++) decim +=
> samplebuf[i] * decimate_taps[i];
> 0.20 │430:┌─→vmovss 0x4(%r13,%rbx,1),%xmm1
> 3.63 │ │ vmovss 0x0(%r13,%rbx,1),%xmm0
> 12.35 │ │ vmovss 0x4(%r12,%rbx,1),%xmm3
> 0.31 │ │ vmovss (%r12,%rbx,1),%xmm2
> 0.02 │ │ add $0x8,%rbx
> 36.48 │ │→ callq __mulsc3
> 0.01 │ │ vmovss -0x78(%rbp),%xmm6
> 0.00 │ │ vmovss -0x80(%rbp),%xmm4
> 23.70 │ │ vmovq %xmm0,-0x68(%rbp)
> 14.25 │ │ vaddss -0x68(%rbp),%xmm6,%xmm5
> 1.54 │ │ vaddss -0x64(%rbp),%xmm4,%xmm7
> 0.48 │ │ vmovss %xmm5,-0x78(%rbp)
> 5.92 │ │ vmovss %xmm7,-0x80(%rbp)
> │ ├──cmp $0x2590,%rbx
> 0.01 │ └──jne 430
>
> At -Ofast,
>
> │ for(int i=0; i<=decimate_taps_length; i++) decim +=
> samplebuf[i] * decimate_taps[i];
> 9.36 │5e0: vpermilps $0xf5,(%r12,%rax,1),%ymm0
> 15.56 │ vpermilps $0xa0,(%r12,%rax,1),%ymm1
> 11.24 │ vmulps (%rbx,%rax,1),%ymm0,%ymm0
> 17.55 │ vpermilps $0xb1,(%rbx,%rax,1),%ymm4
> 3.31 │ add $0x20,%rax
> 2.11 │ vmovaps %ymm1,%ymm3
> 6.62 │ vfmadd132ps %ymm4,%ymm0,%ymm3
> 3.79 │ vfmsub231ps %ymm4,%ymm1,%ymm0
> 2.91 │ vblendps $0xaa,%ymm0,%ymm3,%ymm0
> 10.75 │ vaddps %ymm0,%ymm6,%ymm6
> │ cmp $0x2580,%rax
> 5.59 │ ↑ jne 5e0
> 0.01 │ vmovss 0x258c(%rbx),%xmm0
> 0.01 │ vmovss -0x70(%rbp),%xmm7
> 0.01 │ vmovss %xmm5,-0xd0(%rbp)
> 0.05 │ vextractf128 $0x1,%ymm6,%xmm3
> 0.01 │ vmovss 0x2588(%rbx),%xmm8
> 0.03 │ vshufps $0xff,%xmm3,%xmm3,%xmm13
Looks like so. Could you try this out with current trunk?