On 06/25/2012 12:21 PM, Justin Ruggles wrote:
> On 06/25/2012 04:26 AM, Loren Merritt wrote:
>> mix_n_to_1 avx/fma4: still have a load that could be a memory arg
> 
> I'm sorry, but I don't see it.
> 
>> mix_2_to_1_fltp_flt: lacks fma4
> 
> That will be addressed separately after I have another look at the
> fmaddps patch.
> 
>> mix_2_to_1_s16p_q8: use pmaddwd as a madd, not just as a mul.
> 
> Ok, I'll also look at that separately.
> 
>> Is anything latency-bottlenecked? addps is 3/1 on intel and 5/1 on
>> bulldozer, so for max throughput you'd need 3 or 5 independent adds close
>> enough together for out of order execution to notice them. OOOE does apply
>> across multiple loop iterations, but the lookahead might not be big enough
>> to hold 5 iterations of 8_to_1. Multiple accumulators would help this at
>> the cost of register pressure.
> 
> I suppose it depends on the cost of running out of registers. For
> stereo, this would basically move more of the matrix coefficients to the
> stack. For mono, it can certainly afford the additional register usage
> though. I can test the SSE version on atom and see if using multiple
> accumulators helps at all.

On Athlon64, for SSE float, 6_to_1 was slightly faster with multiple
accumulators. 7_to_1 was about the same. All others were slower.

On Sandy Bridge, for AVX float, 7_to_1 was slightly faster with multiple
accumulators. All others were slower.

Because the difference is so small, I think the extra code and
complexity to only use it in those specific cases is not worth it.

-Justin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to