Not a super-thorough review by any means, but anyway... On Sun, Sep 18, 2016 at 7:35 PM, Alexandra Hájková <alexandra.khirn...@gmail.com> wrote:
[...] > +SECTION_RODATA Check if any of the constants are duplicates of already existing ones. [...] > +%macro TR_4x4 2 > + ; interleaves src0 with src2 to m0 > + ; and src1 with scr3 to m2 > + ; src0: 00 01 02 03 m0: 00 02 01 21 02 22 03 23 00 _20_ 01 21 02 22 03 23 [...] > + SWAP 3, 0 > + SWAP 3, 2 SWAP 3, 2, 0 [...] > +cglobal hevc_idct_4x4_ %+ %1, 1, 14, 14, coeffs I'm pretty sure this functions doesn't require 14 GPRs and 14 vector registers. [...] > +%macro STORE_16 5 > + movu [rsp + %1], %5 > + movu [rsp + %2], %3 > +%endmacro I don't see any reason for doing unaligned stores. This will likely result in a performance hit compared to using aligned ones. [...] > +%macro E8_O8 8 > + pmaddwd m6, m4, %3 > + pmaddwd m7, m5, %4 > + paddd m6, m7 > + > +%if %8 == 8 > + paddd %7, m8 > +%endif > + > + paddd m7, m6, %7 ; o8 + e8 > + psubd %7, m6 ; e8 - o8 > + STORE_%8 %5 + %1, %6 + %1, %7, %2, m7 > +%endmacro If you do the middle paddd inside TR_4x4 macro instead (it even takes a parameter for this already) you need to do 8 fewer adds in the 8x8 idct. You also save a register which means the function will work in 32-bit x86 as well. [...] > +; transpose src packed in m4, m5 > +; to m3, m1 > +%macro TRANSPOSE 0 > + SBUTTERFLY wd, 4, 5, 8 > + SBUTTERFLY dq, 4, 5, 8 > +%endmacro "TRANSPOSE" is kind of generic, a bit more specific macro name would be useful. The comment is also wrong. Furthermore, in simple macros like this it's IMO preferable to have the registers as arguments instead of being hard-coded. [...] > +%macro SWAP_BLOCKS 5 > + ; M_i > + LOAD_BLOCK m6, m7, %2, %2 + %3, %2 + 2 * %3, %2 + 3 * %3, %1 > + > + ; M_j > + LOAD_BLOCK m4, m5, %4, %4 + %3, %4 + 2 * %3, %4 + 3 * %3, %5 > + TRANSPOSE > + STORE_PACKED m4, m5, %2, %2 + %3, %2 + 2 * %3, %2 + 3 * %3, %1 > + > + ; transpose and store M_i > + SWAP m6, m4 > + SWAP m7, m5 > + TRANSPOSE > + STORE_PACKED m4, m5, %4, %4 + %3, %4 + 2 * %3, %4 + 3 * %3, %5 > +%endmacro You should perform the loads in the same order as you operate on them. E.g. LOAD m4, m5 TRANSPOSE m4, m5, m6 LOAD m6, m7 STORE m4, m5 TRANSPOSE m6, m7, m4 STORE m6, m7 [...] > +cglobal hevc_idct_8x8_ %+ %1, 1, 14, 14, coeffs I'm pretty sure this functions doesn't require 14 GPRs and 14 vector registers either. [...] > +; %1, 2 - transform constants > +; %3, 4 - regs with interleaved coeffs > +%macro ADD 4 > + pmaddwd m8, %3, %1 > + pmaddwd m9, %4, %2 > + paddd m8, m9 > + paddd m10, m8 > +%endmacro ADD is defined in x86inc already which could potentially cause weird issues, use a different macro name. [...] > +; %1 ... %4 transform coeffs > +; %5, %6 offsets for storing e+o/e-o back to coeffsq > +; %7 - shift > +; %8 - add > +; %9 - block_size > +%macro E16_O16 9 > + pxor m10, m10 > + ADD %1, %2, m0, m1 > + ADD %3, %4, m2, m3 > + > + movu m4, [rsp + %5] > +%if %9 == 8 > + paddd m4, %8 > +%endif > + > + paddd m5, m10, m4 ; o16 + e16 > + psubd m4, m10 ; e16 - o16 > + STORE_%9 %5, %6, m4, %7, m5 > +%endmacro Zeroing the accumulator can be avoided by for example making the ADD macro take an additional parameter which switches between paddd and SWAP. Also it doesn't seem like you're using that many registers here so try to use a lower register number instead of m10. Align your offsets so you don't have to do unaligned loads/stores. Looking at the disassembly, 50% of your loads/stores are misaligned by 8 for no obvious reason. Furthermore, the distance between the 16-byte stores seems to be 32 bytes which means half of the stack space is sitting unused. You're storing 8 registers worth of data to the stack before loading it back again. If you avoid using more than 8 xmm registers you could use the remaining 8 for temporary storage instead of the stack on x86-64 (keep using the stack on x86-32). [...] > +%macro TR_16x4 9 > + mova m12, [pd_64] > + > + ; produce 8x4 matrix of e16 coeffs > + ; for 4 first rows and store it on stack (128 bytes) > + TR_8x4 %1, 7, %4, %5, %6, %8 > + > + ; load 8 even rows > + LOAD_BLOCK m0, m1, %9 * %6, %9 * 3 * %6, %9 * 5 * %6, %9 * 7 * %6, %1 > + LOAD_BLOCK m2, m3, %9 * 9 * %6, %9 * 11 * %6, %9 * 13 * %6, %9 * 15 * > %6, %1 > + > + SBUTTERFLY wd, 0, 1, 4 > + SBUTTERFLY wd, 2, 3, 4 > + > + mova m7, %3 > + > + E16_O16 [pw_90_87], [pw_80_70], [pw_57_43], [pw_25_9], 0 + %1, 15 * %6 > + %1, %2, m7, %7 > + E16_O16 [pw_87_57], [pw_9_m43], [pw_m80_m90], [pw_m70_m25], %6 + %1, 14 > * %6 + %1, %2, m7, %7 > + E16_O16 [pw_80_9], [pw_m70_m87], [pw_m25_57], [pw_90_43], 2 * %6 + %1, > 13 * %6 + %1, %2, m7, %7 > + E16_O16 [pw_70_m43], [pw_m87_9], [pw_90_25], [pw_m80_m57], 3 * %6 + %1, > 12 * %6 + %1, %2, m7, %7 > + E16_O16 [pw_57_m80], [pw_m25_90], [pw_m9_m87], [pw_43_70], 4 * %6 + %1, > 11 * %6 + %1, %2, m7, %7 > + E16_O16 [pw_43_m90], [pw_57_25], [pw_m87_70], [pw_9_m80], 5 * %6 + %1, > 10 * %6 + %1, %2, m7, %7 > + E16_O16 [pw_25_m70], [pw_90_m80], [pw_43_9], [pw_m57_87], 6 * %6 + %1, 9 > * %6 + %1, %2, m7, %7 > + E16_O16 [pw_9_m25], [pw_43_m57], [pw_70_m80], [pw_87_m90], 7 * %6 + %1, > 8 * %6 + %1, %2, m7, %7 > +%endmacro m12 doesn't seem to be used anywhere? [...] > +%macro IDCT_16x16 1 > +cglobal hevc_idct_16x16_ %+ %1, 1, 1, 15, 1024, coeffs > + TR_16x4 0, 7, [pd_64], 64, 2, 32, 8, 16, 1 > + TR_16x4 8, 7, [pd_64], 64, 2, 32, 8, 16, 1 > + TR_16x4 16, 7, [pd_64], 64, 2, 32, 8, 16, 1 > + TR_16x4 24, 7, [pd_64], 64, 2, 32, 8, 16, 1 > + TRANSPOSE_16x16 > + > + DEFINE_BIAS %1 > + TR_16x4 0, shift, [arr_add], 64, 2, 32, 8, 16, 1 > + TR_16x4 8, shift, [arr_add], 64, 2, 32, 8, 16, 1 > + TR_16x4 16, shift, [arr_add], 64, 2, 32, 8, 16, 1 > + TR_16x4 24, shift, [arr_add], 64, 2, 32, 8, 16, 1 > + TRANSPOSE_16x16 > + > + RET > +%endmacro You're not using 15 vector registers, nor 1024 bytes of stack space. 128 bytes are actually used, but they're spread out over 264 bytes of stack space. Being able to do stuff in a loop would be very helpful because this function is 13kB (on 64-bit Linux) and the L1i cache is going to be sad otherwise. [...] > +cglobal hevc_idct_32x32_ %+ %1, 1, 1, 15, 4096, coeffs Basically most of the things I mentioned for 16x16 holds true for 32x32 as well, with the exception that you wont be able to avoid using the stack. The code size of this function is absolutely insane at a whooping 75 kB. That's more than twice as large as the entire L1i cache on most CPUs! Being able to loop this is almost a requirement. [...] > +INIT_XMM avx > +IDCT_4x4 8 > +IDCT_8x8 8 > +IDCT_16x16 8 > +IDCT_32x32 8 I don't think any fancy instructions are being used, so might as well instantiate SSE2 versions as well. _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel