Re: [libav-devel] [PATCH 1/2] hevc: Add AVX IDCT

Henrik Gramner Mon, 19 Sep 2016 11:42:01 -0700

Not a super-thorough review by any means, but anyway...

On Sun, Sep 18, 2016 at 7:35 PM, Alexandra Hájková
<alexandra.khirn...@gmail.com> wrote:


[...]

> +SECTION_RODATA

Check if any of the constants are duplicates of already existing ones.

[...]

> +%macro TR_4x4 2
> +    ; interleaves src0 with src2 to m0
> +    ;         and src1 with scr3 to m2
> +    ; src0: 00 01 02 03     m0: 00 02 01 21 02 22 03 23

00 _20_ 01 21 02 22 03 23

[...]

> +    SWAP 3, 0
> +    SWAP 3, 2

SWAP 3, 2, 0

[...]

> +cglobal hevc_idct_4x4_ %+ %1, 1, 14, 14, coeffs

I'm pretty sure this functions doesn't require 14 GPRs and 14 vector registers.

[...]

> +%macro STORE_16 5
> +    movu    [rsp + %1], %5
> +    movu    [rsp + %2], %3
> +%endmacro

I don't see any reason for doing unaligned stores. This will likely
result in a performance hit compared to using aligned ones.

[...]

> +%macro E8_O8 8
> +    pmaddwd m6, m4, %3
> +    pmaddwd m7, m5, %4
> +    paddd m6, m7
> +
> +%if %8 == 8
> +    paddd %7, m8
> +%endif
> +
> +    paddd m7, m6, %7 ; o8 + e8
> +    psubd %7, m6     ; e8 - o8
> +    STORE_%8 %5 + %1, %6 + %1, %7, %2, m7
> +%endmacro

If you do the middle paddd inside TR_4x4 macro instead (it even takes
a parameter for this already) you need to do 8 fewer adds in the 8x8
idct. You also save a register which means the function will work in
32-bit x86 as well.

[...]

> +; transpose src packed in m4, m5
> +;                      to m3, m1
> +%macro TRANSPOSE 0
> +    SBUTTERFLY wd, 4, 5, 8
> +    SBUTTERFLY dq, 4, 5, 8
> +%endmacro

"TRANSPOSE" is kind of generic, a bit more specific macro name would
be useful. The comment is also wrong. Furthermore, in simple macros
like this it's IMO preferable to have the registers as arguments
instead of being hard-coded.

[...]

> +%macro SWAP_BLOCKS 5
> +    ; M_i
> +    LOAD_BLOCK m6, m7, %2, %2 + %3, %2 + 2 * %3, %2 + 3 * %3, %1
> +
> +    ; M_j
> +    LOAD_BLOCK m4, m5, %4, %4 + %3, %4 + 2 * %3, %4 + 3 * %3, %5
> +    TRANSPOSE
> +    STORE_PACKED m4, m5, %2, %2 + %3, %2 + 2 * %3, %2 + 3 * %3, %1
> +
> +    ; transpose and store M_i
> +    SWAP m6, m4
> +    SWAP m7, m5
> +    TRANSPOSE
> +    STORE_PACKED m4, m5, %4, %4 + %3, %4 + 2 * %3, %4 + 3 * %3, %5
> +%endmacro

You should perform the loads in the same order as you operate on them. E.g.
    LOAD      m4, m5
    TRANSPOSE m4, m5, m6
    LOAD      m6, m7
    STORE     m4, m5
    TRANSPOSE m6, m7, m4
    STORE     m6, m7

[...]

> +cglobal hevc_idct_8x8_ %+ %1, 1, 14, 14, coeffs

I'm pretty sure this functions doesn't require 14 GPRs and 14 vector
registers either.

[...]

> +; %1, 2 - transform constants
> +; %3, 4 - regs with interleaved coeffs
> +%macro ADD 4
> +    pmaddwd m8, %3, %1
> +    pmaddwd m9, %4, %2
> +    paddd   m8, m9
> +    paddd   m10, m8
> +%endmacro

ADD is defined in x86inc already which could potentially cause weird
issues, use a different macro name.

[...]

> +; %1 ... %4 transform coeffs
> +; %5, %6 offsets for storing e+o/e-o back to coeffsq
> +; %7 - shift
> +; %8 - add
> +; %9 - block_size
> +%macro E16_O16 9
> +    pxor m10, m10
> +    ADD %1, %2, m0, m1
> +    ADD %3, %4, m2, m3
> +
> +    movu m4, [rsp + %5]
> +%if %9 == 8
> +    paddd m4, %8
> +%endif
> +
> +    paddd m5, m10, m4 ; o16 + e16
> +    psubd m4, m10      ; e16 - o16
> +    STORE_%9 %5, %6, m4, %7, m5
> +%endmacro

Zeroing the accumulator can be avoided by for example making the ADD
macro take an additional parameter which switches between paddd and
SWAP. Also it doesn't seem like you're using that many registers here
so try to use a lower register number instead of m10.

Align your offsets so you don't have to do unaligned loads/stores.
Looking at the disassembly, 50% of your loads/stores are misaligned by
8 for no obvious reason. Furthermore, the distance between the 16-byte
stores seems to be 32 bytes which means half of the stack space is
sitting unused.

You're storing 8 registers worth of data to the stack before loading
it back again. If you avoid using more than 8 xmm registers you could
use the remaining 8 for temporary storage instead of the stack on
x86-64 (keep using the stack on x86-32).

[...]

> +%macro TR_16x4 9
> +    mova m12, [pd_64]
> +
> +    ; produce 8x4 matrix of e16 coeffs
> +    ; for 4 first rows and store it on stack (128 bytes)
> +    TR_8x4 %1, 7, %4, %5, %6, %8
> +
> +    ; load 8 even rows
> +    LOAD_BLOCK m0, m1, %9 * %6, %9 * 3 * %6, %9 * 5 * %6, %9 * 7 * %6, %1
> +    LOAD_BLOCK m2, m3, %9 * 9 * %6, %9 * 11 * %6, %9 * 13 * %6, %9 * 15 * 
> %6, %1
> +
> +    SBUTTERFLY wd, 0, 1, 4
> +    SBUTTERFLY wd, 2, 3, 4
> +
> +    mova m7, %3
> +
> +    E16_O16 [pw_90_87], [pw_80_70], [pw_57_43], [pw_25_9],  0 + %1, 15 * %6 
> + %1, %2, m7, %7
> +    E16_O16 [pw_87_57], [pw_9_m43], [pw_m80_m90], [pw_m70_m25], %6 + %1, 14 
> * %6 + %1, %2, m7, %7
> +    E16_O16 [pw_80_9], [pw_m70_m87], [pw_m25_57], [pw_90_43], 2 * %6 + %1, 
> 13 * %6 + %1, %2, m7, %7
> +    E16_O16 [pw_70_m43], [pw_m87_9], [pw_90_25], [pw_m80_m57], 3 * %6 + %1, 
> 12 * %6 + %1, %2, m7, %7
> +    E16_O16 [pw_57_m80], [pw_m25_90], [pw_m9_m87], [pw_43_70], 4 * %6 + %1, 
> 11 * %6 + %1, %2, m7, %7
> +    E16_O16 [pw_43_m90], [pw_57_25], [pw_m87_70], [pw_9_m80], 5 * %6 + %1, 
> 10 * %6 + %1, %2, m7, %7
> +    E16_O16 [pw_25_m70], [pw_90_m80], [pw_43_9], [pw_m57_87], 6 * %6 + %1, 9 
> * %6 + %1, %2, m7, %7
> +    E16_O16 [pw_9_m25], [pw_43_m57], [pw_70_m80], [pw_87_m90], 7 * %6 + %1, 
> 8 * %6 + %1, %2, m7, %7
> +%endmacro

m12 doesn't seem to be used anywhere?

[...]

> +%macro IDCT_16x16 1
> +cglobal hevc_idct_16x16_ %+ %1, 1, 1, 15, 1024, coeffs
> +    TR_16x4 0, 7, [pd_64], 64, 2, 32, 8, 16, 1
> +    TR_16x4 8, 7, [pd_64], 64, 2, 32, 8, 16, 1
> +    TR_16x4 16, 7, [pd_64], 64, 2, 32, 8, 16, 1
> +    TR_16x4 24, 7, [pd_64], 64, 2, 32, 8, 16, 1
> +    TRANSPOSE_16x16
> +
> +    DEFINE_BIAS %1
> +    TR_16x4 0, shift, [arr_add], 64, 2, 32, 8, 16, 1
> +    TR_16x4 8, shift, [arr_add], 64, 2, 32, 8, 16, 1
> +    TR_16x4 16, shift, [arr_add], 64, 2, 32, 8, 16, 1
> +    TR_16x4 24, shift, [arr_add], 64, 2, 32, 8, 16, 1
> +    TRANSPOSE_16x16
> +
> +    RET
> +%endmacro

You're not using 15 vector registers, nor 1024 bytes of stack space.
128 bytes are actually used, but they're spread out over 264 bytes of
stack space.

Being able to do stuff in a loop would be very helpful because this
function is 13kB (on 64-bit Linux) and the L1i cache is going to be
sad otherwise.

[...]

> +cglobal hevc_idct_32x32_ %+ %1, 1, 1, 15, 4096, coeffs

Basically most of the things I mentioned for 16x16 holds true for
32x32 as well, with the exception that you wont be able to avoid using
the stack.

The code size of this function is absolutely insane at a whooping 75
kB. That's more than twice as large as the entire L1i cache on most
CPUs! Being able to loop this is almost a requirement.

[...]

> +INIT_XMM avx
> +IDCT_4x4 8
> +IDCT_8x8 8
> +IDCT_16x16 8
> +IDCT_32x32 8

I don't think any fancy instructions are being used, so might as well
instantiate SSE2 versions as well.
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 1/2] hevc: Add AVX IDCT

Reply via email to