On Tue, 24 May 2011, Daniel Kang wrote:
On Tue, May 24, 2011 at 10:43 AM, Loren Merritt <[email protected]>wrote:

Are you sure you don't want to deinline the idct part and unroll the loop
over blocks? If not, what's different about h264_idct_add16_sse2?

Different as compared to what?

Difference between h264_idct_add16_sse2 h264_idct_add16_10_sse2 that would cause different strategies to be optimal.

I can unroll it if you prefer.

I'm not stating a preference, I'm describing a strategy that might or might not be faster. Strategy found by pattern-matching on existing code, not by abstract reasoning; I expect a large speed gain only because a comment on the existing code says so.

In a previous patch you had deinlined IDCT8. Did you decide that it's ok to
spend 2kb on this function? Or 4kb since h264_idct8_add4_10 doesn't call
h264_idct8_add_10?

That patch was for x264. Here, arguments change. I guess I could make
another function if you really prefer... It would require pushing args to
the stack or xchg's.

Likewise, the preference is for whichever is faster, combined with a prediction that icache is important.

--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to