On Wed, 2 Nov 2016, Martin Storsjö wrote:

This work is sponsored by, and copyright, Google.

These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                    ARM   AArch64
vp9_avg4_neon:                      32.2      23.7
vp9_avg8_neon:                      57.5      53.7
vp9_avg16_neon:                    168.6     165.4
vp9_avg32_neon:                    586.7     585.2
vp9_avg64_neon:                   2458.6    2325.9
vp9_avg_8tap_smooth_4h_neon:       130.7     124.0
vp9_avg_8tap_smooth_4hv_neon:      478.8     440.3
vp9_avg_8tap_smooth_4v_neon:       118.0      96.2
vp9_avg_8tap_smooth_8h_neon:       239.7     232.0
vp9_avg_8tap_smooth_8hv_neon:      691.3     649.9
vp9_avg_8tap_smooth_8v_neon:       238.0     214.5
vp9_avg_8tap_smooth_64h_neon:    11512.9   11492.8
vp9_avg_8tap_smooth_64hv_neon:   23322.1   23255.1
vp9_avg_8tap_smooth_64v_neon:    11556.2   11554.5
vp9_put4_neon:                      18.0      16.5
vp9_put8_neon:                      40.2      37.7
vp9_put16_neon:                     99.4      95.2
vp9_put32_neon:                    348.8     307.4
vp9_put64_neon:                   1321.3    1109.8
vp9_put_8tap_smooth_4h_neon:       124.7     117.3
vp9_put_8tap_smooth_4hv_neon:      465.8     425.3
vp9_put_8tap_smooth_4v_neon:       105.0      82.5
vp9_put_8tap_smooth_8h_neon:       227.7     218.2
vp9_put_8tap_smooth_8hv_neon:      661.4     620.1
vp9_put_8tap_smooth_8v_neon:       208.0     187.2
vp9_put_8tap_smooth_64h_neon:    10864.6   10873.9
vp9_put_8tap_smooth_64hv_neon:   21359.4   21295.7
vp9_put_8tap_smooth_64v_neon:     9629.1    9639.4

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
---
v2: Updated according to the comments on the 32 bit version.
---
libavcodec/aarch64/Makefile              |   2 +
libavcodec/aarch64/vp9dsp_init_aarch64.c | 139 ++++++
libavcodec/aarch64/vp9mc_neon.S          | 733 +++++++++++++++++++++++++++++++
libavcodec/vp9.h                         |   1 +
libavcodec/vp9dsp.c                      |   2 +
5 files changed, 877 insertions(+)
create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/vp9mc_neon.S

+function ff_vp9_copy64_neon, export=1
+1:
+        ldp             x5,  x6,  [x2]
+        stp             x5,  x6,  [x0]
+        ldp             x5,  x6,  [x2, #16]
+        stp             x5,  x6,  [x0, #16]
+        subs            w4,  w4,  #1
+        ldp             x5,  x6,  [x2, #32]
+        stp             x5,  x6,  [x0, #32]
+        ldp             x5,  x6,  [x2, #48]
+        stp             x5,  x6,  [x0, #48]
+        add             x2,  x2,  x3
+        add             x0,  x0,  x1
+        b.ne            1b
+        ret
+endfunc

I forgot to mention it anywhere, but the copy32 and copy64 functions don't actually use any vector registers at all, but only plain aarch64 ldp/stp. When implemented with neon loads/stores, they ended up significantly slower than the C version, on my dragonboard.

Currently copy64 runs at around 1100 cycles, while a trivial neon version (that loads all 64 bytes at once with a ld1 {v0,v1,v2,v3}) runs at around 1600 cycles. One could of course play with all different combinations of loading 16, 32 or 64 bytes per ld1 and scheduling them differently (IIRC I did try some of those combinations at least), but I never got down to what the C version did unless I use ldp/stp.

Technically, having a _neon prefix for them is wrong, but anything else (omitting these two while hooking up avg32/avg64 separately) is more complication - although I'm open for suggestions on how to handle it best.

// Martin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to