aarch64: vp9lpf: Keep the comparison to E within 8 bit

Janne Grunau Thu, 23 Feb 2017 12:27:06 -0800

On 2017-02-11 22:19:02 +0200, Martin Storsjö wrote:
> On Fri, 10 Feb 2017, Janne Grunau wrote:
> 
> >On 2017-01-15 22:55:48 +0200, Martin Storsjö wrote:
> >>The theoretical maximum value of E is 193, so we can just
> >>saturate the addition to 255.
> >>
> >>Before:                     Cortex A7      A8      A9     A53  A53/AArch64
> >>vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
> >>vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
> >>vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
> >>vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
> >>After:
> >>vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
> >>vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
> >>vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
> >>vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7
> >>---
> >> libavcodec/aarch64/vp9lpf_neon.S | 40 
> >> +++++++++-------------------------------
> >> libavcodec/arm/vp9lpf_neon.S     | 11 +++++------
> >> 2 files changed, 14 insertions(+), 37 deletions(-)
> >>
> >>diff --git a/libavcodec/aarch64/vp9lpf_neon.S 
> >>b/libavcodec/aarch64/vp9lpf_neon.S
> >>index 3b8e6eb..4553173 100644
> >>--- a/libavcodec/aarch64/vp9lpf_neon.S
> >>+++ b/libavcodec/aarch64/vp9lpf_neon.S
> >>@@ -51,13 +51,6 @@
> >> // see the arm version instead.
> >>
> >>
> >>-.macro uabdl_sz dst1, dst2, in1, in2, sz
> >>-        uabdl           \dst1,  \in1\().8b,  \in2\().8b
> >>-.ifc \sz, .16b
> >>-        uabdl2          \dst2,  \in1\().16b, \in2\().16b
> >>-.endif
> >>-.endm
> >>-
> >> .macro add_sz dst1, dst2, in1, in2, in3, in4, sz
> >>         add             \dst1,  \in1,  \in3
> >> .ifc \sz, .16b
> >>@@ -86,20 +79,6 @@
> >> .endif
> >> .endm
> >>
> >>-.macro cmhs_sz dst1, dst2, in1, in2, in3, in4, sz
> >>-        cmhs            \dst1,  \in1,  \in3
> >>-.ifc \sz, .16b
> >>-        cmhs            \dst2,  \in2,  \in4
> >>-.endif
> >>-.endm
> >>-
> >>-.macro xtn_sz dst, in1, in2, sz
> >>-        xtn             \dst\().8b,  \in1
> >>-.ifc \sz, .16b
> >>-        xtn2            \dst\().16b, \in2
> >>-.endif
> >>-.endm
> >>-
> >> .macro usubl_sz dst1, dst2, in1, in2, sz
> >>         usubl           \dst1,  \in1\().8b,  \in2\().8b
> >> .ifc \sz, .16b
> >>@@ -179,20 +158,20 @@
> >> // tmpq2 == tmp3 + tmp4, etc.
> >> .macro loop_filter wd, sz, mix, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, 
> >> tmp8
> >> .if \mix == 0
> >>-        dup             v0.8h,  w2        // E
> >>-        dup             v1.8h,  w2        // E
> >>+        dup             v0\sz,  w2        // E
> >>         dup             v2\sz,  w3        // I
> >>         dup             v3\sz,  w4        // H
> >> .else
> >>-        dup             v0.8h,  w2        // E
> >>+        dup             v0.8b,  w2        // E
> >>         dup             v2.8b,  w3        // I
> >>         dup             v3.8b,  w4        // H
> >>+        lsr             w5,     w2,  #8
> >>         lsr             w6,     w3,  #8
> >>         lsr             w7,     w4,  #8
> >>-        ushr            v1.8h,  v0.8h, #8 // E
> >>+        dup             v1.8b,  w5        // E
> >>         dup             v4.8b,  w6        // I
> >>-        bic             v0.8h,  #255, lsl 8 // E
> >>         dup             v5.8b,  w7        // H
> >>+        trn1            v0.2d,  v0.2d,  v1.2d
> >
> >isn't this equivalent to
> >
> >dup  v0.8h, w2
> >uzp1 v0.16b, v0.16b, v0.16b
> >
> >on little endian?
> 
> Nice idea, but it isn't quite as straightforward on aarch64 - on arm it
> would have been.


gah, yes.

> All the even values will be output in the output registers of uzp1, so 
> you need uzp2 as well.
> 
> So instead of this as we have now:
> 
>     dup  v0.8b, w2
>     lsr  w5, w2, #8
>     dup  v1.8b, w5
>     trn1 v0.2d, v0.2d, v1.2d
> 
> We could do:
> 
>     dup  v0.8h, w2
>     uzp2 v1.16b, v0.16b, v0.16b
>     uzp1 v0.16b, v0.16b, v0.16b
>     trn1 v0.2d, v0.2d, v1.2d

rev16 v1.16b, v0.16b // or ext ..x or any other instruction
uzp1  v0.16b, v0.16b, v1.16b

is one instruction less but also not straight forward

ok as is

Janne
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 2/6] arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit

Reply via email to