Hi,

On Tue, May 14, 2024 at 4:40 PM Stone Chen <chen.stonec...@gmail.com> wrote:

> +    vvc_sad_8:
> +        .loop_height:
> +        movu              xm0, [src1q]
> +        movu              xm1, [src2q]
> +        MIN_MAX_SAD       xm2, xm0, xm1
> +        vpmovzxwd          m1, xm1
> +        vpaddd             m3, m1
>
[..]

> +    vvc_sad_16_128:
> +        .loop_height:
>
[..]

> +        .loop_width:
> +            movu              xm0, [src1q]
> +            movu              xm1, [src2q]
> +            MIN_MAX_SAD       xm2, xm0, xm1
> +            vpmovzxwd          m1, xm1
> +            vpaddd             m3, m1
>

Wouldn't it be more efficient if the main loops did a full register worth
at a time?

vpbroadcastd m4, [pw_1]
loop:
movu m0, [src1q]
movu m1, [src2q]
MIN_MAX_SAD m2, m0, m1
pmaddwd m1, m4
paddd m3, m1

(And then for w8, load 2 rows per iteration using movu xmN, [row0] and
vinserti128 mN, [row1], 1.)

Ronald
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to