On Tue, May 24, 2011 at 10:43 AM, Loren Merritt <[email protected]>wrote:

> +%macro IDCT_ADD16_10 1
>> +cglobal h264_idct_add16_10_%1, 5,7
>> +    xor          r5, r5
>> +%ifdef PIC
>> +    lea         r11, [scan8_mem]
>> +%endif
>> +.nextblock
>> +    movzx        r6, byte [scan8+r5]
>> +    movzx        r6, byte [r4+r6]
>> +    test         r6, r6
>>
>
> cmp byte [r4+r6], 0
>
Done


>  +    jz .skipblock
>> +    mov         r6d, dword [r1+r5*4]
>> +    lea          r6, [r0+r6]
>>
>
> add r6, r0
>
Done


>  +    IDCT4_ADD_10 r6, r2, r3
>> +.skipblock
>> +    inc          r5
>> +    add          r2, 64
>> +    cmp          r5, 16
>> +    jl .nextblock
>> +    REP_RET
>> +%endmacro
>>
>
> Are you sure you don't want to deinline the idct part and unroll the loop
> over blocks? If not, what's different about h264_idct_add16_sse2?
>
Different as compared to what?
I can unroll it if you prefer.


>  +%macro IDCT_ADD16INTRA_10 1
>> +cglobal h264_idct_add16intra_10_%1,5,7
>> +    xor          r5, r5
>> +%ifdef PIC
>> +    lea         r11, [scan8_mem]
>> +%endif
>> +.nextblock
>> +    movzx        r6, byte [scan8+r5]
>> +    movzx        r6, byte [r4+r6]
>> +    or          r6d, dword [r2]
>> +    test         r6, r6
>>
>
> or already sets flags.
> Check dc-only, or is that rarer in 10bit?
>
I have 7 short samples, so I'm not sure.


>  +cglobal h264_idct_dc_add_10_mmx2,3,3
>> +    mov      r1d, dword [r1]
>> +    add       r1, 32
>> +    sar       r1, 6
>> +    movd      m0, r1d
>>
>
> I would expect that to be faster in mmx, even though no simd is possible.
> Especially on amd, where movd mm,r32 is slow.
>
Done, but no AMD box to test.


>  +cglobal h264_idct8_add_10_%1, 3,4,8
>> +    %assign pad 256+16-gprsize-(stack_offset&15)
>> +    SUB         rsp, pad
>> +
>> +    add   dword [r1], 32
>> +    IDCT8_ADD_SSE_START r1   , rsp
>> +    IDCT8_ADD_SSE_START r1+16, rsp+128
>> +    lea          r3, [r0+8]
>> +    IDCT8_ADD_SSE_END   r0  , rsp,    r2
>> +    IDCT8_ADD_SSE_END   r3  , rsp+16, r2
>>
>
> In a previous patch you had deinlined IDCT8. Did you decide that it's ok to
> spend 2kb on this function? Or 4kb since h264_idct8_add4_10 doesn't call
> h264_idct8_add_10?
>

That patch was for x264. Here, arguments change. I guess I could make
another function if you really prefer... It would require pushing args to
the stack or xchg's.
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to