On 2016-11-14 11:59:39 +0200, Martin Storsjö wrote:
> On Mon, 14 Nov 2016, Janne Grunau wrote:
> 
> >Since aarch64 has enough free general purpose registers use them to
> >branch to the appropiate storage code. 1-2 cycles faster for the
> >functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
> >(up to 2 cycles faster/slower) on a cortex-a57.
> >---
> >libavcodec/aarch64/vp9lpf_neon.S | 48 
> >+++++++++++++++-------------------------
> >1 file changed, 18 insertions(+), 30 deletions(-)
> >
> >diff --git a/libavcodec/aarch64/vp9lpf_neon.S 
> >b/libavcodec/aarch64/vp9lpf_neon.S
> >index 995a97d..3a82bd4 100644
> >--- a/libavcodec/aarch64/vp9lpf_neon.S
> >+++ b/libavcodec/aarch64/vp9lpf_neon.S
> >@@ -410,15 +410,19 @@
> >.endif
> >        // If no pixels needed flat8in nor flat8out, jump to a
> >        // writeout of the inner 4 pixels
> >-        cbz             x5,  7f
> >+        cbnz            x5,  1f
> >+        br              x14
> >+1:
> >        mov             x5,  v7.d[0]
> >.ifc \sz, .16b
> >        mov             x6,  v2.d[1]
> >        orr             x5,  x5,  x6
> >.endif
> >        // If no pixels need flat8out, jump to a writeout of the inner 6 
> > pixels
> >-        cbz             x5,  8f
> >+        cbnz            x5,  1f
> >+        br              x15
> >
> >+1:
> >        // flat8out
> >        // This writes all outputs into v2-v17 (skipping v6 and v16).
> >        // If this part is skipped, the output is read from v21-v26 (which 
> > is the input
> >@@ -549,35 +553,24 @@ endfunc
> >
> >function vp9_loop_filter_8
> >        loop_filter     8,  .8b,  0,    v16, v17, v18, v19, v28, v29, v30, 
> > v31
> >-        mov             x5,  #0
> >        ret
> >6:
> >-        mov             x5,  #6
> >-        ret
> >+        br              x13
> >9:
> >        br              x10
> >endfunc
> 
> Looks really neat, thanks!
> 
> Couldn't you get rid of the 6: label here as well, with something like this?
> 
> @@ -352,7 +352,13 @@
>  .endif
>          // If no pixels need flat8in, jump to flat8out
>          // (or to a writeout of the inner 4 pixels, for wd=8)
> +.if \wd == 16
>          cbz             x5,  6f
> +.else
> +        cbnz            x5,  6f
> +        br              x13
> +6:
> +.endif

I don't think this will have a measurable effect. If anything it could 
make branch prediction for the full loop filter worse (static branch 
prediction is "conditional branch is not taken"). It also makes the 
already complicated loop filter macro a little bit more complicated to 
remove mostly clear code after the macro instantiation. So I think we 
shouldn't do it.

> And similarly for the 9: label for all cases except \wd == 16 (where 
> we need it for the clobbered registers).

the same applies here. I tried a different approach for the \wd == 16 
case: mov x12, x30 and using x10 instead to return to the stack 
clean-up. That ended up 1 cycle slower for the adr x10, 
$stack_clean_label though.

Janne
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to