On 11/07/2011 06:26 PM, Loren Merritt wrote: > On Mon, 7 Nov 2011, Justin Ruggles wrote: > >> +.loop: >> + movu m1, [v1q+offsetq] >> + mulps m1, m1, [v2q+offsetq] >> + addps m0, m0, m1 >> + add offsetq, mmsize >> js .loop > > addps had latency 3 or 4, whereas the loop should be 1 or 2 cycles per > iteration just counting uops. Thus it's latency bound and could be > improved by multiple accumulators.
I just realized that the only use of this function we have currently is in aacdec and requires it to work with a length with multiple of 4. I couldn't even find a sample that triggers the function (I had to insert a dummy call to test it). So I'll drop the AVX part for now. I have another use for this function in the AC-3 encoder (per-band energy calculation), but it requires it to work with both unaligned input and arbitrary lengths. So I'll put that on my TODO list and revisit the AVX part at that time. -Justin _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
