Hi,

On Tue, Jan 3, 2012 at 4:17 PM, Jason Garrett-Glaser <[email protected]> wrote:
>> +    add          srcq, 3 * mmsize / 2
>> +    punpcklbw      m0, m7                 ; (word) { B0, G0, R0, B1 }
>> +    punpcklbw      m1, m7                 ; (word) { R0, B1, G1, R1 }
>> +    punpcklbw      m2, m7                 ; (word) { B2, G2, R2, B3 }
>> +    punpcklbw      m3, m7                 ; (word) { R2, B3, G3, R3 }
>> +    pmaddwd        m0, coeff1             ; (dword) { B0*BY + G0*GY, B1*BY }
>> +    pmaddwd        m1, coeff2             ; (dword) { R0*RY, G1+GY + R1*RY }
>> +    pmaddwd        m2, coeff1             ; (dword) { B2*BY + G2*GY, B3*BY }
>> +    pmaddwd        m3, coeff2             ; (dword) { R2*RY, G3+GY + R3*RY }
>
> A lower-precision SSSE3-based maddubsw version might be applicable
> here for later work?

Is planned for later under a -flag fast type thing. However, before
code like that is added, I'd like to ensure that we have automated
tests in place to ensure that the code remains correct. (This isn't
hard, we already have off-by-1 or off-by-2 tests for audio/float.)

> Would this be faster with pmulhw and a different byte ordering?

Haven't tested this yet - you mean pmulhw, followed by paddw x2 (or
phaddw+paddw using current byte ordering) so we get rid of the psrad
x, 15 + packssdw, instead of the current pmaddwd+paddd+psrad+packssdw?
In the ssse3 case, that would probably help, since pshufb gives us
freedom in byte ordering (I expect phaddw to be slow). I don't think
it would've helped much in the old case (which you reviewed) because
of byte ordering limitations and lack of an efficient horizontal
add...

(Also note that this would be slightly less precise and thus would
require changing the fate tests to allow off-by-one. I don't want to
add new code that is untested, it's a recipe for disaster.)

Ronald
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to