On Mon, 14 Nov 2016, Janne Grunau wrote:

On 2016-11-14 11:59:39 +0200, Martin Storsjö wrote:
On Mon, 14 Nov 2016, Janne Grunau wrote:

Since aarch64 has enough free general purpose registers use them to
branch to the appropiate storage code. 1-2 cycles faster for the
functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
(up to 2 cycles faster/slower) on a cortex-a57.
---
libavcodec/aarch64/vp9lpf_neon.S | 48 +++++++++++++++-------------------------
1 file changed, 18 insertions(+), 30 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index 995a97d..3a82bd4 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -410,15 +410,19 @@
.endif
       // If no pixels needed flat8in nor flat8out, jump to a
       // writeout of the inner 4 pixels
-        cbz             x5,  7f
+        cbnz            x5,  1f
+        br              x14
+1:
       mov             x5,  v7.d[0]
.ifc \sz, .16b
       mov             x6,  v2.d[1]
       orr             x5,  x5,  x6
.endif
       // If no pixels need flat8out, jump to a writeout of the inner 6 pixels
-        cbz             x5,  8f
+        cbnz            x5,  1f
+        br              x15

+1:
       // flat8out
       // This writes all outputs into v2-v17 (skipping v6 and v16).
       // If this part is skipped, the output is read from v21-v26 (which is 
the input
@@ -549,35 +553,24 @@ endfunc

function vp9_loop_filter_8
       loop_filter     8,  .8b,  0,    v16, v17, v18, v19, v28, v29, v30, v31
-        mov             x5,  #0
       ret
6:
-        mov             x5,  #6
-        ret
+        br              x13
9:
       br              x10
endfunc

Looks really neat, thanks!

Couldn't you get rid of the 6: label here as well, with something like this?

@@ -352,7 +352,13 @@
 .endif
         // If no pixels need flat8in, jump to flat8out
         // (or to a writeout of the inner 4 pixels, for wd=8)
+.if \wd == 16
         cbz             x5,  6f
+.else
+        cbnz            x5,  6f
+        br              x13
+6:
+.endif

I don't think this will have a measurable effect. If anything it could
make branch prediction for the full loop filter worse (static branch
prediction is "conditional branch is not taken"). It also makes the
already complicated loop filter macro a little bit more complicated to
remove mostly clear code after the macro instantiation. So I think we
shouldn't do it.

Right, yes, and at most, it removes one branch step from the return path for the already fast-ish early exits.

Patch ok then.

// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to