On 21/08/14 10:03 AM, Hendrik Leppkes wrote: > On Thu, Aug 21, 2014 at 12:42 AM, James Almer <jamr...@gmail.com> wrote: >> * Reduced xmm register count to 7 (As such they are now enabled for x86_32). >> * Removed four movdqa (affects the sse2 version only). >> * pxor is now used to clear m0 only once. >> >> ~5% faster. >> >> Signed-off-by: James Almer <jamr...@gmail.com> >> --- > > Good job, faster and 32-bit compat! > >> libavcodec/x86/hevc_res_add.asm | 122 >> ++++++++++++++++------------------------ >> libavcodec/x86/hevcdsp_init.c | 10 ++-- >> 2 files changed, 51 insertions(+), 81 deletions(-) >> >> diff --git a/libavcodec/x86/hevc_res_add.asm >> b/libavcodec/x86/hevc_res_add.asm >> index feea50c..7238fb3 100644 >> --- a/libavcodec/x86/hevc_res_add.asm >> +++ b/libavcodec/x86/hevc_res_add.asm >> @@ -88,71 +88,41 @@ cglobal hevc_transform_add4_8, 3, 4, 6 >> movhps [r0+r3 ], m1 >> %endmacro >> >> -%macro TR_ADD_INIT_SSE_8 0 >> - pxor m0, m0 >> - >> - mova m4, [r1] >> - mova m1, [r1+16] >> - psubw m2, m0, m1 >> - psubw m5, m0, m4 >> - packuswb m4, m1 >> - packuswb m5, m2 >> - >> - mova m6, [r1+32] >> - mova m1, [r1+48] >> - psubw m2, m0, m1 >> - psubw m7, m0, m6 >> - packuswb m6, m1 >> - packuswb m7, m2 >> - >> - mova m8, [r1+64] >> - mova m1, [r1+80] >> - psubw m2, m0, m1 >> - psubw m9, m0, m8 >> - packuswb m8, m1 >> - packuswb m9, m2 >> - >> - mova m10, [r1+96] >> - mova m1, [r1+112] >> - psubw m2, m0, m1 >> - psubw m11, m0, m10 >> - packuswb m10, m1 >> - packuswb m11, m2 >> -%endmacro >> - >> - >> -%macro TR_ADD_SSE_16_8 0 >> - TR_ADD_INIT_SSE_8 >> - >> - paddusb m0, m4, [r0 ] >> - paddusb m1, m6, [r0+r2 ] >> - paddusb m2, m8, [r0+r2*2] >> - paddusb m3, m10,[r0+r3 ] >> - psubusb m0, m5 >> - psubusb m1, m7 >> - psubusb m2, m9 >> - psubusb m3, m11 >> - mova [r0 ], m0 >> - mova [r0+r2 ], m1 >> - mova [r0+2*r2], m2 >> - mova [r0+r3 ], m3 >> -%endmacro >> - >> -%macro TR_ADD_SSE_32_8 0 >> - TR_ADD_INIT_SSE_8 >> - >> - paddusb m0, m4, [r0 ] >> - paddusb m1, m6, [r0+16 ] >> - paddusb m2, m8, [r0+r2 ] >> - paddusb m3, m10,[r0+r2+16] >> - psubusb m0, m5 >> - psubusb m1, m7 >> - psubusb m2, m9 >> - psubusb m3, m11 >> - mova [r0 ], m0 >> - mova [r0+16 ], m1 >> - mova [r0+r2 ], m2 >> - mova [r0+r2+16], m3 >> +%macro TR_ADD_SSE_16_32_8 3 >> + mova m2, [r1+%1 ] >> + mova m6, [r1+%1+16] >> +%if cpuflag(avx) >> + psubw m1, m0, m2 >> + psubw m5, m0, m6 >> +%else >> + mova m1, m0 >> + mova m5, m0 >> + psubw m1, m2 >> + psubw m5, m6 >> +%endif > > I was wondering about these blocks - doesn't the x264asm layer > automatically add the mova's when you just use the 3-arg form on sse2? > Or is there a speed benefit grouping the mov's? > > - Hendrik
It does that, but on older SSE2 cpus with not-so-good OOO execution grouping instructions like this might help reduce dependencies a bit. This "trick" is used all over the tree (including some macros from x86util), so it's certainly useful for some cpus. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel