On 06/25/2012 04:26 AM, Loren Merritt wrote:
> mix_n_to_1 avx/fma4: still have a load that could be a memory arg

I'm sorry, but I don't see it.

> mix_2_to_1_fltp_flt: lacks fma4

That will be addressed separately after I have another look at the
fmaddps patch.

> mix_2_to_1_s16p_q8: use pmaddwd as a madd, not just as a mul.

Ok, I'll also look at that separately.

> Is anything latency-bottlenecked? addps is 3/1 on intel and 5/1 on
> bulldozer, so for max throughput you'd need 3 or 5 independent adds close
> enough together for out of order execution to notice them. OOOE does apply
> across multiple loop iterations, but the lookahead might not be big enough
> to hold 5 iterations of 8_to_1. Multiple accumulators would help this at
> the cost of register pressure.

I suppose it depends on the cost of running out of registers. For
stereo, this would basically move more of the matrix coefficients to the
stack. For mono, it can certainly afford the additional register usage
though. I can test the SSE version on atom and see if using multiple
accumulators helps at all.

-Justin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to