On 06/25/2012 04:26 AM, Loren Merritt wrote: > mix_n_to_1 avx/fma4: still have a load that could be a memory arg
I'm sorry, but I don't see it. > mix_2_to_1_fltp_flt: lacks fma4 That will be addressed separately after I have another look at the fmaddps patch. > mix_2_to_1_s16p_q8: use pmaddwd as a madd, not just as a mul. Ok, I'll also look at that separately. > Is anything latency-bottlenecked? addps is 3/1 on intel and 5/1 on > bulldozer, so for max throughput you'd need 3 or 5 independent adds close > enough together for out of order execution to notice them. OOOE does apply > across multiple loop iterations, but the lookahead might not be big enough > to hold 5 iterations of 8_to_1. Multiple accumulators would help this at > the cost of register pressure. I suppose it depends on the cost of running out of registers. For stereo, this would basically move more of the matrix coefficients to the stack. For mono, it can certainly afford the additional register usage though. I can test the SSE version on atom and see if using multiple accumulators helps at all. -Justin _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel