Hi, actual asm now:
On Sat, Jan 5, 2013 at 9:01 AM, Daniel Kang <[email protected]> wrote: > +%macro CHECK 2 > + movu m2, [curq+mrefsq+%1] > + movu m3, [curq+prefsq+%2] > + mova m4, m2 > + mova m5, m2 > + pxor m4, m3 > + pavgb m5, m3 pxor m4, m2, m3 pavgb m5, m2, m3 > + mova m4, m2 > + psubusb m2, m3 > + psubusb m3, m4 > + pmaxub m2, m3 psubusb m4, m3, m2 psubusb m2, m3 pmaxub m2, m4 > + mova m3, m2 > + mova m4, m2 > +%if mmsize == 16 > + psrldq m3, 1 > + psrldq m4, 2 > +%else > + psrlq m3, 8 > + psrlq m4, 16 > +%endif psrldq m3, m2, 1 psrldq m4, m2, 2 same for the other half. > +%macro CHECK1 0 > + mova m3, m0 > + pcmpgtw m3, m2 pcmpgtw m3, m0, m2 > + mova m6, m3 > + pand m5, m3 > + pandn m3, m1 I suppose m6 is a backup for elsewhere, so can probably be re-arranged for AVX also. > + mova m1, m3 > +%endmacro Same. > +%macro CHECK2 0 > + paddw m6, [pw_1] > + psllw m6, 14 > + paddsw m2, m6 > + mova m3, m0 > + pcmpgtw m3, m2 pcmpgtw m3, m0, m2 > + pminsw m0, m2 > + pand m5, m3 > + pandn m3, m1 > + por m3, m5 > + mova m1, m3 > +%endmacro Again avx. > +%macro FILTER 3 > +.loop%1: > + pxor m7, m7 > + LOAD 0, [curq+mrefsq] > + LOAD 1, [curq+prefsq] > + LOAD 2, [%2] > + LOAD 3, [%3] > + mova m4, m3 > + paddw m3, m2 > + psraw m3, 1 > + mova [rsp+ 0], m0 > + mova [rsp+16], m3 > + mova [rsp+32], m1 Use m8-15 on x86-64. > + LOAD 3, [prevq+mrefsq] > + LOAD 4, [prevq+prefsq] [..] > + LOAD 3, [nextq+mrefsq] > + LOAD 4, [nextq+prefsq] [..] > + movu m2, [curq+mrefsq-1] > + movu m3, [curq+prefsq-1] [..] > + LOAD 2, [%2+mrefsq*2] > + LOAD 4, [%3+mrefsq*2] > + LOAD 3, [%2+prefsq*2] > + LOAD 5, [%3+prefsq*2] [..] > + add dstq, mmsize/2 > + add prevq, mmsize/2 > + add curq, mmsize/2 > + add nextq, mmsize/2 > + sub wd, mmsize/2 > + cmp wd, 0 > + jg .loop%1 So, it seems to me mrefsq/prefsq doesn't change, can we increment them before the filter and then use [dstq] instead? If you do that, we can sub dstq, wq (after sign-extend wd->wq), neg wq, index [dstq] with [dstq+wq] in the main loop and then the last 7 lines can simply be replaced by: add wq, mmsize/2 jl .loop%1 (i.e. you lose all the adds; you can lose the cmp regardless, it is implied by the sub.) Ronald _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
