On 06/25/2012 12:21 PM, Justin Ruggles wrote: > On 06/25/2012 04:26 AM, Loren Merritt wrote: >> mix_n_to_1 avx/fma4: still have a load that could be a memory arg > > I'm sorry, but I don't see it. > >> mix_2_to_1_fltp_flt: lacks fma4 > > That will be addressed separately after I have another look at the > fmaddps patch. > >> mix_2_to_1_s16p_q8: use pmaddwd as a madd, not just as a mul. > > Ok, I'll also look at that separately. > >> Is anything latency-bottlenecked? addps is 3/1 on intel and 5/1 on >> bulldozer, so for max throughput you'd need 3 or 5 independent adds close >> enough together for out of order execution to notice them. OOOE does apply >> across multiple loop iterations, but the lookahead might not be big enough >> to hold 5 iterations of 8_to_1. Multiple accumulators would help this at >> the cost of register pressure. > > I suppose it depends on the cost of running out of registers. For > stereo, this would basically move more of the matrix coefficients to the > stack. For mono, it can certainly afford the additional register usage > though. I can test the SSE version on atom and see if using multiple > accumulators helps at all.
On Athlon64, for SSE float, 6_to_1 was slightly faster with multiple accumulators. 7_to_1 was about the same. All others were slower. On Sandy Bridge, for AVX float, 7_to_1 was slightly faster with multiple accumulators. All others were slower. Because the difference is so small, I think the extra code and complexity to only use it in those specific cases is not worth it. -Justin _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel