On Mon, 7 Nov 2011, Justin Ruggles wrote:

> +.loop:
> +    movu            m1, [v1q+offsetq]
> +    mulps           m1, m1, [v2q+offsetq]
> +    addps           m0, m0, m1
> +    add        offsetq, mmsize
>      js           .loop

addps had latency 3 or 4, whereas the loop should be 1 or 2 cycles per
iteration just counting uops. Thus it's latency bound and could be
improved by multiple accumulators.

> +%if cpuflag(avx)
> +    vextractf128  xmm0, ymm0, 0

Does this work? Docs say that (like any VEX op) vextractf128 to xmm
clobbers the upper half of the corresponding ymm.
And it's unnecessary, xmm0 is already the lower half of ymm0.

> +    vextractf128  xmm1, ymm0, 1
> +    addps         xmm0, xmm1
> +%endif
> +%if cpuflag(sse3)
> +    haddps        xmm0, xmm0
> +    haddps        xmm0, xmm0

Is this really an improvement? How about pshuflw?

> +%else
>      movhlps xmm1, xmm0
>      addps   xmm0, xmm1
>      movss   xmm1, xmm0
>      shufps  xmm0, xmm0, 1
>      addss   xmm0, xmm1
> +%endif
>  %ifndef ARCH_X86_64
>      movd    r0m,  xmm0
>      fld     dword r0m
>  %endif
>      RET
> +%endmacro

Does this need a vzeroupper?

--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to