Hi, On Tue, Jan 3, 2012 at 4:17 PM, Jason Garrett-Glaser <[email protected]> wrote: >> + add srcq, 3 * mmsize / 2 >> + punpcklbw m0, m7 ; (word) { B0, G0, R0, B1 } >> + punpcklbw m1, m7 ; (word) { R0, B1, G1, R1 } >> + punpcklbw m2, m7 ; (word) { B2, G2, R2, B3 } >> + punpcklbw m3, m7 ; (word) { R2, B3, G3, R3 } >> + pmaddwd m0, coeff1 ; (dword) { B0*BY + G0*GY, B1*BY } >> + pmaddwd m1, coeff2 ; (dword) { R0*RY, G1+GY + R1*RY } >> + pmaddwd m2, coeff1 ; (dword) { B2*BY + G2*GY, B3*BY } >> + pmaddwd m3, coeff2 ; (dword) { R2*RY, G3+GY + R3*RY } > > A lower-precision SSSE3-based maddubsw version might be applicable > here for later work?
Is planned for later under a -flag fast type thing. However, before code like that is added, I'd like to ensure that we have automated tests in place to ensure that the code remains correct. (This isn't hard, we already have off-by-1 or off-by-2 tests for audio/float.) > Would this be faster with pmulhw and a different byte ordering? Haven't tested this yet - you mean pmulhw, followed by paddw x2 (or phaddw+paddw using current byte ordering) so we get rid of the psrad x, 15 + packssdw, instead of the current pmaddwd+paddd+psrad+packssdw? In the ssse3 case, that would probably help, since pshufb gives us freedom in byte ordering (I expect phaddw to be slow). I don't think it would've helped much in the old case (which you reviewed) because of byte ordering limitations and lack of an efficient horizontal add... (Also note that this would be slightly less precise and thus would require changing the fate tests to allow off-by-one. I don't want to add new code that is untested, it's a recipe for disaster.) Ronald _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
