Hi, On Tue, Jul 24, 2012 at 9:41 PM, Ronald S. Bultje <rsbul...@gmail.com> wrote: > Hi, > > On Sat, Jul 14, 2012 at 9:29 PM, Justin Ruggles > <justin.rugg...@gmail.com> wrote: >> --- >> libavresample/x86/audio_convert.asm | 38 >> ++++++++++++++++++++++++++++++++ >> libavresample/x86/audio_convert_init.c | 11 +++++++++ >> 2 files changed, 49 insertions(+), 0 deletions(-) >> >> diff --git a/libavresample/x86/audio_convert.asm >> b/libavresample/x86/audio_convert.asm >> index 9ba7251..70519e1 100644 >> --- a/libavresample/x86/audio_convert.asm >> +++ b/libavresample/x86/audio_convert.asm >> @@ -734,3 +734,41 @@ CONV_FLTP_TO_FLT_6CH >> INIT_XMM avx >> CONV_FLTP_TO_FLT_6CH >> %endif >> + >> +;------------------------------------------------------------------------------ >> +; void ff_conv_s16_to_s16p_2ch(int16_t *const *dst, int16_t *src, int len, >> +; int channels); >> +;------------------------------------------------------------------------------ >> + >> +%macro CONV_S16_TO_S16P_2CH 0 >> +cglobal conv_s16_to_s16p_2ch, 3,4,3, dst0, src, len, dst1 >> + lea lenq, [2*lend] >> + mov dst1q, [dst0q+gprsize] >> + mov dst0q, [dst0q ] >> + lea srcq, [srcq+2*lenq] >> + add dst0q, lenq >> + add dst1q, lenq >> + neg lenq >> + ALIGN 16 >> +.loop: >> + mova m0, [srcq+2*lenq ] >> + mova m1, [srcq+2*lenq+mmsize] >> + pshuflw m0, m0, q3120 >> + pshufhw m0, m0, q3120 >> + pshuflw m1, m1, q3120 >> + pshufhw m1, m1, q3120 >> + shufps m2, m0, m1, q2020 >> + shufps m0, m1, q3131 >> + mova [dst0q+lenq], m2 >> + mova [dst1q+lenq], m0 > > The more common way to do this (I believe) is to set up mask reg: > > pcmpeqb m4, m4 > psrlw m4, 8 ; 0x00ff > > Then mask/shift: > > mova m0, [srcq+2*lenq+0*mmsize] > mova m1, [srcq+2*lenq+1*mmsize] > psrlw m2, m0, 8 > psrlw m3, m1, 8 > pand m0, m4 > pand m1, m4 > packsswb m0, m1 > packsswb m2, m3 > mova [dst1q+lenq], m0 > mova [dst2q+lenq], m2 > > However, that's not less instructions, maybe worth checking anyway. > > Alternatively, a pshufb version: > > mova m3, [pb_02468ace13579bdf] > .loop: > mova m0, [srcq+2*lenq+0*mmsize] > mova m1, [srcq+2*lenq+1*mmsize] > pshufb m0, m3 > pshufb m1, m3 > punpcklqdq m2, m0, m1 > punpckhqdq m0, m1 > mova [dst1q+lenq], m2 > mova [dst2q+lenq], m0 > > 2 instructions less, and only 2 unpacks as opposed to all the > shuffles, so potentially faster (except on Atom where pshufb is > dog-slow).
Actually that's all byte-based, but I guess it's obvious what I mean so should be easy to convert to word-speak. Ronald _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel