Hello, following Henrik Gramner comments (in discussion "avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)") in attach new patch to add AVX2 version for each 8b func (except divide)
001 : avutil : add ABS2 for avx2 002 : avfilter : add AVX2 version for most of the func, the AVX2 is a simple modification VBROADCASTi128, for constant loading when the process stay in 8bits when the process use intermediate 16 bits i add two macro for the load part PMOVZXBW : load mmsize/2 bits and expand to 16 (the sse4 version seems to be most of the time slower than the SSE2 "emulation") like the avx2 doesn't need zero fill vector register i add a if/else, at the start of each blend macro, and change the index of the vector registers %macro GRAINEXTRACT 0 %if cpuflag(avx2) BLEND_INIT grainextract, 3 %else ; SSE2 BLEND_INIT grainextract, 4 pxor m3, m3 %endif for the store part i add PACKUSWB_AND_STORE macro simplify code of each blend macro pass fate test for me Checkasm result (x86_64, kaby lake) ./tests/checkasm/checkasm --test=vf_blend --bench benchmarking with native FFmpeg timers nop: 35.7 checkasm: using random seed 3558581064 SSE2: - vf_blend.8bit [OK] SSSE3: - vf_blend.8bit [OK] AVX2: - vf_blend.8bit [OK] checkasm: all 37 tests passed addition_c: 20523.3 addition_sse2: 441.8 addition_avx2: 383.3 and_c: 14490.3 and_sse2: 485.8 and_avx2: 205.8 average_c: 15600.5 average_sse2: 1206.0 average_avx2: 773.0 darken_c: 27218.0 darken_sse2: 397.3 darken_avx2: 194.3 difference_c: 20607.8 difference_sse2: 980.8 difference_ssse3: 968.0 difference_avx2: 487.0 extremity_c: 17286.0 extremity_sse2: 1174.0 extremity_ssse3: 981.8 extremity_avx2: 550.0 grainextract_c: 22145.3 grainextract_sse2: 1158.5 grainextract_avx2: 771.5 grainmerge_c: 24505.5 grainmerge_sse2: 1158.8 grainmerge_avx2: 774.5 hardmix_c: 16505.5 hardmix_sse2: 490.8 hardmix_avx2: 388.8 lighten_c: 27153.0 lighten_sse2: 485.0 lighten_avx2: 251.3 multiply_c: 16459.8 multiply_sse2: 1382.5 multiply_avx2: 844.0 negation_c: 32143.8 negation_sse2: 1369.0 negation_ssse3: 1175.3 negation_avx2: 522.5 or_c: 13359.5 or_sse2: 397.3 or_avx2: 195.8 phoenix_c: 31159.8 phoenix_sse2: 551.0 phoenix_avx2: 310.5 screen_c: 25372.3 screen_sse2: 1804.0 screen_avx2: 1069.0 subtract_c: 16782.5 subtract_sse2: 478.8 subtract_avx2: 236.5 xor_c: 15374.8 xor_sse2: 491.3 xor_avx2: 237.0 Martin
0001-avutil-x86-x86util-add-ABS2-for-AVX2.patch
Description: Binary data
0002-avfilter-x86-vf_blend-add-AVX2-version-for-each-func.patch
Description: Binary data
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel