On Tue, 24 May 2011, Daniel Kang wrote:
On Tue, May 24, 2011 at 10:43 AM, Loren Merritt <[email protected]>wrote:
Are you sure you don't want to deinline the idct part and unroll the loop
over blocks? If not, what's different about h264_idct_add16_sse2?
Different as compared to what?
Difference between h264_idct_add16_sse2 h264_idct_add16_10_sse2 that would
cause different strategies to be optimal.
I can unroll it if you prefer.
I'm not stating a preference, I'm describing a strategy that might
or might not be faster. Strategy found by pattern-matching on existing
code, not by abstract reasoning; I expect a large speed gain only because
a comment on the existing code says so.
In a previous patch you had deinlined IDCT8. Did you decide that it's ok to
spend 2kb on this function? Or 4kb since h264_idct8_add4_10 doesn't call
h264_idct8_add_10?
That patch was for x264. Here, arguments change. I guess I could make
another function if you really prefer... It would require pushing args to
the stack or xchg's.
Likewise, the preference is for whichever is faster, combined with a
prediction that icache is important.
--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel