Hi,

On Mon, Jul 4, 2011 at 10:09 AM, Daniel Kang <[email protected]> wrote:
> +INIT_MMX
> +PRED8x8_DC mmxext
> +PRED8x8_DC sse2

You seem to do this "keep in mmx regs for sse2" for all DCs. Why? It
seems you can save several instructions, plus the overhead of moving
from mmx to xmm, by doing everything in xmm registers...

> +    lea         r2, [r1+r1*2]

If you do this one a little more up (where you do the vertical DC),
you can use it instead of having to do lea r0, [r0+r1*2].

+cglobal pred8x8_dc_10_%1, 2,4
[..]
+    mov         r0, r4

That seems wrong, you're declaring to use 4 registers, but use r4
also. I think on x86-64, you can use r10/r11, and on x86-32, you can
prevent the mov and just restore the value from r0m, if you want.

> +INIT_MMX
> +PRED8x8_TOP_DC mmxext
> +PRED8x8_TOP_DC sse2

Same here as above (mmx->xmm moves).

> +;-----------------------------------------------------------------------------
> +;void pred8x8l_dc(pixel *src, int has_topleft, int has_topright, int stride)
> +;-----------------------------------------------------------------------------
> +%macro PRED8x8L_DC 1
> +cglobal pred8x8l_dc_10_%1, 4,5,8
> +    sub         r0, r3
> +    lea         r4, [r0+r3*2]
> +    mova        m0, [r0+r3*1-16]
> +    punpckhwd   m0, [r0+r3*0-16]

When I measured, SIMD-vertical-DC was never faster than doing this
part in scalar, as you do in horizontal. Did you measure this and
compare it to a scalar implementation for vertical-DC?

> +%if mmsize==16
> +    mova  m0, [r0+ 0]
> +    mova  m1, [r0+16]
> +%else
> +    movq  m0, [r0+ 0]
> +    movq  m1, [r0+ 8]
> +    movq  m2, [r0+16]
> +    movq  m3, [r0+24]
> +%endif

mova m0, [r0+0]
mova m1, [r0+mmsize]
%if mmsize==8
mova m2, [r0+16]
mova m3, [r0+24]
%endif

> +;-----------------------------------------------------------------------------
> +; void pred16x16_horizontal(pixel *src, int stride)
> +;-----------------------------------------------------------------------------
> +%macro PRED16x16_HORIZONTAL 1
> +cglobal pred16x16_horizontal_10_%1, 2,3
> +    sub    r0, r1
[..]
> +    movd   m0, [r0+r1*1-4]
> +    movd   m1, [r0+r1*2-4]

Why the sub?

Ronald
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to