output.S: refactor ff_yuv2plane1_8_neon

Martin Storsjö Fri, 07 Feb 2025 02:10:10 -0800

On Fri, 31 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

The benchmarks (before vs after) were gathered using
./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 | grep yuv2yuv1

Thanks for the numbers! To keep it more readable and concise, we'd onlyneed to see the numbers for the _neon versions here - and we probablydon't need to see all the variants/combinations of it, just one or acouple would be fine.


Anyway, the numbers look great, thanks!

yuv2yuv1_16_512_accurate_c:                           1439.9 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                          77.6 (18.55x)
yuv2yuv1_16_512_approximate_c:                        1422.1 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                       78.1 (18.20x)
yuv2yuv1_19_512_accurate_c:                           1447.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                          78.1 (18.52x)
yuv2yuv1_19_512_approximate_c:                        1474.4 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                       78.1 (18.87x)

Krzysztof

---
libswscale/aarch64/output.S | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

Your signature above does get included as part of the commit message,while it's probably not needed/intended :-) I'm editing it out beforemerging though.

(I also end up needing to apply your actual email address and name, as themailing list adds the "via ffmpeg-devel" to the name and rewrites the fromaddress. But that's obviously outside of your control.)


diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index 934d62dfd0..190c438870 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -214,21 +214,15 @@ function ff_yuv2plane1_8_neon, export=1
        and             w4, w4, #7
        cbz             w4, 1f                          // check if offsetting 
present
        ext             v0.8b, v0.8b, v0.8b, #3         // honor offsetting 
which can be 0 or 3 only
-1:      uxtl            v0.8h, v0.8b                    // extend dither to 
32-bit
-        uxtl            v1.4s, v0.4h
-        uxtl2           v2.4s, v0.8h
+1:
+        uxtl            v0.8h, v0.8b                    // extend dither to 
32-bit
2:
        ld1             {v3.8h}, [x0], #16              // read 8x16-bit @ 
src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
-        sxtl            v4.4s, v3.4h
-        sxtl2           v5.4s, v3.8h
-        add             v4.4s, v4.4s, v1.4s
-        add             v5.4s, v5.4s, v2.4s
-        sqshrun         v4.4h, v4.4s, #6
-        sqshrun2        v4.8h, v5.4s, #6
-
-        uqshrn          v3.8b, v4.8h, #1                // clip8(val>>7)
        subs            w2, w2, #8                      // dstW -= 8
-        st1             {v3.8b}, [x1], #8               // write to destination
+        shadd           v1.8h, v0.8h, v3.8h             // v1 = (v0 + v3) >> 1
+        sqshrun         v2.8b, v1.8h, #6                // clip_uint8(v1 >> 6)
+
+        st1             {v2.8b}, [x1], #8               // write to destination
        b.gt            2b                              // loop until width 
consumed
        ret
endfunc
--


Thanks, these changes look good - will push.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] swscale/aarch64/output.S: refactor ff_yuv2plane1_8_neon

Reply via email to