On Sat, May 18, 2024 at 11:33 AM Ronald S. Bultje <rsbul...@gmail.com> wrote:
> Hi, > > On Tue, May 14, 2024 at 4:40 PM Stone Chen <chen.stonec...@gmail.com> > wrote: > >> + vvc_sad_8: >> + .loop_height: >> + movu xm0, [src1q] >> + movu xm1, [src2q] >> + MIN_MAX_SAD xm2, xm0, xm1 >> + vpmovzxwd m1, xm1 >> + vpaddd m3, m1 >> > [..] > >> + vvc_sad_16_128: >> + .loop_height: >> > [..] > >> + .loop_width: >> + movu xm0, [src1q] >> + movu xm1, [src2q] >> + MIN_MAX_SAD xm2, xm0, xm1 >> + vpmovzxwd m1, xm1 >> + vpaddd m3, m1 >> > Hi Ronald, > Wouldn't it be more efficient if the main loops did a full register worth > at a time? > > vpbroadcastd m4, [pw_1] > loop: > movu m0, [src1q] > movu m1, [src2q] > MIN_MAX_SAD m2, m0, m1 > pmaddwd m1, m4 > paddd m3, m1 > > (And then for w8, load 2 rows per iteration using movu xmN, [row0] and > vinserti128 mN, [row1], 1.) > > Ronald > Thank you, I didn't know about the pmaddwd instruction, using it is definitely more efficient! Stone _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".