On Tue, Jan 16, 2018 at 11:33 PM, Martin Vignali <martin.vign...@gmail.com> wrote: > BLEND_INIT grainextract, 4
You could also try doing twice as much per iteration which might be more efficient, especially in avx2 since it avoids cross-lane shuffles. Applies to some other ones as well. E.g. something like: pxor m4, m4 VBROADCASTI128 m5, [pw_128] .loop: movu m1, [topq + xq] movu m3, [bottomq + xq] punpcklbw m0, m1, m4 punpckhbw m1, m4 punpcklbw m2, m3, m4 punpckhbw m3, m4 paddw m0, m5 paddw m1, m5 psubw m0, m2 psubw m1, m3 packuswb m0, m1 mova [dstq + xq], m0 add xq, mmsize jl .loop > BLEND_INIT average, 3 pavgb should probably be more efficient than unpacking to words. It does round up so some bitflipping shenanigans might be required if you want to round down. E.g. something like: pcmpeqb m2, m2 .loop: movu m0, [topq + xq] movu m1, [bottomq + xq] pxor m0, m2 pxor m1, m2 pavgb m0, m1 pxor m0, m2 mova [dstq + xq], m0 add xq, mmsize jl .loop (optionally combining movu+pxor into a 3-arg pxor with avx since memory operands can be unaligned in VEX-encoded instructions). _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel