On Mon, 7 Nov 2011, Justin Ruggles wrote: > +.loop: > + movu m1, [v1q+offsetq] > + mulps m1, m1, [v2q+offsetq] > + addps m0, m0, m1 > + add offsetq, mmsize > js .loop
addps had latency 3 or 4, whereas the loop should be 1 or 2 cycles per iteration just counting uops. Thus it's latency bound and could be improved by multiple accumulators. > +%if cpuflag(avx) > + vextractf128 xmm0, ymm0, 0 Does this work? Docs say that (like any VEX op) vextractf128 to xmm clobbers the upper half of the corresponding ymm. And it's unnecessary, xmm0 is already the lower half of ymm0. > + vextractf128 xmm1, ymm0, 1 > + addps xmm0, xmm1 > +%endif > +%if cpuflag(sse3) > + haddps xmm0, xmm0 > + haddps xmm0, xmm0 Is this really an improvement? How about pshuflw? > +%else > movhlps xmm1, xmm0 > addps xmm0, xmm1 > movss xmm1, xmm0 > shufps xmm0, xmm0, 1 > addss xmm0, xmm1 > +%endif > %ifndef ARCH_X86_64 > movd r0m, xmm0 > fld dword r0m > %endif > RET > +%endmacro Does this need a vzeroupper? --Loren Merritt _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
