aarch64: Add neon implementation for vsad8_intra

Martin Storsjö Fri, 16 Sep 2022 14:15:35 -0700

On Tue, 13 Sep 2022, Hubert Mazur wrote:

Provide optimized implementation for pix_median_abs16 function.


You've forgot to update this part of the commit message.

Performance comparison tests are shown below.
- vsad_5_c: 94.7
- vsad_5_neon: 20.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <h...@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c |  3 ++
libavcodec/aarch64/me_cmp_neon.S         | 42 ++++++++++++++++++++++++
2 files changed, 45 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c 
b/libavcodec/aarch64/me_cmp_init_aarch64.c
index fb51a833be..d3fa047a86 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -45,6 +45,8 @@ int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const 
uint8_t *s2,
                ptrdiff_t stride, int h);
int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
                      ptrdiff_t stride, int h) ;
+int vsad_intra8_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
+                     ptrdiff_t stride, int h) ;
int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                ptrdiff_t stride, int h);
int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
@@ -75,6 +77,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, 
AVCodecContext *avctx)

        c->vsad[0] = vsad16_neon;
        c->vsad[4] = vsad_intra16_neon;
+        c->vsad[5] = vsad_intra8_neon;

        c->vsse[0] = vsse16_neon;
        c->vsse[4] = vsse_intra16_neon;
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index a4a4344f42..73701bd353 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -1050,3 +1050,45 @@ function pix_median_abs16_neon, export=1
        ret

endfunc
+
+function vsad_intra8_neon, export=1
+        // x0           unused
+        // x1           uint8_t *pix1
+        // x2           uint8_t *dummy
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        ld1             {v0.8b}, [x1], x3
+        sub             w4, w4, #1 // we need to make h-1 iterations
+        cmp             w4, #3
+        movi            v16.8h, #0
+        b.lt            2f
+
+1:
+        // v = abs( pix1[0] - pix1[0 + stride] )
+        // score = sum(v)
+        ld1             {v1.8b}, [x1], x3
+        ld1             {v2.8b}, [x1], x3
+        uabal           v16.8h, v0.8b, v1.8b
+        ld1             {v3.8b}, [x1], x3
+        sub             w4, w4, #3

Instinctively, I'd prefer to move the sub instruction up to between thefirst two ld1 instructions here. However I don't see any change inbenchmarks on Cortex A53 due to that, so it's not strictly necessary, butI'd prefer it that way.


Other than that, this looks very reasonable and straightforward.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 2/3] lavc/aarch64: Add neon implementation for vsad8_intra

Reply via email to