On 2019-03-01 18:41, Michael Stoner wrote: > The AVX2 code leverages VPERMD to process 12 pixels/iteration. This is my > first patch submission so any comments are greatly appreciated. > > -Mike > > Tested on Skylake (Win32 & Win64) > 1920x1080 input frame > ===================== > C code - 440 fps > SSSE3 - 920 fps > AVX - 930 fps > AVX2 - 1040 fps > > Regression tested at 1920x1080, 1280x720, and 352x288
> .loop: > %ifidn %1, unaligned > - movu m0, [r0] > + movu m0, [r0] ; yB v5 yA u5 y9 v4 y8 u4 y7 v3 y6 > u3 y5 v2 y4 u2 y3 v1 y2 u1 y1 v0 y0 u0 > %else > mova m0, [r0] > %endif At first I didn't understand why you do so much seemingly unnecessary work. You don't change how the data loaded into register. After more in-depth reading I see now that you shuffle data around just so you can store the data with a single move for each plane. The chroma is below. > +%if cpuflag(avx2) > + vpermd m1, m6, m1 ; 00 v5 v4 v3 00 v2 v1 v0 00 u5 u4 u3 > 00 u2 u1 u0 > + pshufb m1, m7 ; 00 00 v5 v4 v3 v2 v1 v0 00 00 u5 u4 > u3 u2 u1 u0 > + movu [r2+r4], xm1 > + vextracti128 [r3+r4], m1, 1 > +%else > movq [r2+r4], m1 > movhps [r3+r4], m1 > +%endif Sounds commendable but I doubt the use of this many more shuffles gets you much over a naive AVX2 version (where you treat the high half of ymm like an unroll). > +; for AVX2 version only > +v210_luma_permute: dd 0,1,2,4,5,6,7,7 > +v210_chroma_permute: dd 0,1,4,5,2,3,6,7 Are you sure these can't be replaced with vpermq and its immediate operand? It really looks like the second could be. It'll save you a register. > - mova m3, [v210_mult] > - mova m4, [v210_mask] > - mova m5, [v210_luma_shuf] > - mova m6, [v210_chroma_shuf] > + mova m3, [v210_luma_shuf] > + mova m4, [v210_chroma_shuf1] > + > +%if cpuflag(avx2) > + mova m5, [v210_luma_permute] ; VPERMD constant must be in a > register > + mova m6, [v210_chroma_permute] ; VPERMD constant must be in a > register > + mova m7, [v210_chroma_shuf2] > +%endif > + > +%if ARCH_X86_64 > + mova m8, [v210_mult] > + mova m9, [v210_mask] > +%endif > + It would let you clean this up a bit. My suggestion is to make the diff minimal by keeping the existing uses and if you still need more than 8 registers for avx2 then make it available for x86-64 only. Compare yours with the one I committed here https://github.com/Upipe/upipe/blob/master/lib/upipe-v210/v210dec.asm#L45 which is just FFmpeg's cleaned up a little plus avx2. I'm surprised it's not already in FFmpeg. You should do whatever is faster.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel