On 2017-02-11 23:42:05 +0200, Martin Storsjö wrote: > On Sat, 11 Feb 2017, Martin Storsjö wrote: > > >On Fri, 10 Feb 2017, Janne Grunau wrote: > > > >>On 2017-01-15 22:55:52 +0200, Martin Storsjö wrote: > >>>For this case, with 8 inputs but only changing 4 of them, we can fit > >>>all 16 input pixels into a q register, and still have enough temporary > >>>registers for doing the loop filter. > >>> > >>>The wd=8 filters would require too many temporary registers for > >>>processing all 16 pixels at once though. > >>> > >>>Before: Cortex A7 A8 A9 A53 > >>>vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2 > >>>After: > >>>vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0 > >>>--- > >>> libavcodec/arm/vp9dsp_init_arm.c | 7 +- > >>> libavcodec/arm/vp9lpf_neon.S | 191 > >+++++++++++++++++++++++++++++++++++++++ > >>> 2 files changed, 195 insertions(+), 3 deletions(-) > >>> > >>>diff --git a/libavcodec/arm/vp9dsp_init_arm.c > >b/libavcodec/arm/vp9dsp_init_arm.c > >>>index e99d931..1ede170 100644 > >>>--- a/libavcodec/arm/vp9dsp_init_arm.c > >>>+++ b/libavcodec/arm/vp9dsp_init_arm.c > >>>@@ -194,6 +194,8 @@ define_loop_filters(8, 8); > >>> define_loop_filters(16, 8); > >>> define_loop_filters(16, 16); > >>> > >>>+define_loop_filters(44, 16); > >>>+ > >>> #define lf_mix_fn(dir, wd1, wd2, stridea) > >\ > >>> static void loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst, > >\ > >>> ptrdiff_t > >>>stride, > >\ > >>>@@ -207,7 +209,6 @@ static void > >loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst, > >>> lf_mix_fn(h, wd1, wd2, stride) \ > >>> lf_mix_fn(v, wd1, wd2, sizeof(uint8_t)) > >>> > >>>-lf_mix_fns(4, 4) > >>> lf_mix_fns(4, 8) > >>> lf_mix_fns(8, 4) > >>> lf_mix_fns(8, 8) > >>>@@ -227,8 +228,8 @@ static av_cold void > >vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp) > >>> dsp->loop_filter_16[0] = ff_vp9_loop_filter_h_16_16_neon; > >>> dsp->loop_filter_16[1] = ff_vp9_loop_filter_v_16_16_neon; > >>> > >>>- dsp->loop_filter_mix2[0][0][0] = loop_filter_h_44_16_neon; > >>>- dsp->loop_filter_mix2[0][0][1] = loop_filter_v_44_16_neon; > >>>+ dsp->loop_filter_mix2[0][0][0] = ff_vp9_loop_filter_h_44_16_neon; > >>>+ dsp->loop_filter_mix2[0][0][1] = ff_vp9_loop_filter_v_44_16_neon; > >>> dsp->loop_filter_mix2[0][1][0] = loop_filter_h_48_16_neon; > >>> dsp->loop_filter_mix2[0][1][1] = loop_filter_v_48_16_neon; > >>> dsp->loop_filter_mix2[1][0][0] = loop_filter_h_84_16_neon; > >>>diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S > >>>index e31c807..12984a9 100644 > >>>--- a/libavcodec/arm/vp9lpf_neon.S > >>>+++ b/libavcodec/arm/vp9lpf_neon.S > >>>@@ -44,6 +44,109 @@ > >>> vtrn.8 \r2, \r3 > >>> .endm > >>> > >>>+@ The input to and output from this macro is in the registers q8-q15, > >>>+@ and q0-q7 are used as scratch registers. > >>>+@ p3 = q8, p0 = q11, q0 = q12, q3 = q15 > >>>+.macro loop_filter_q > >>>+ vdup.u8 d0, r2 @ E > >>>+ lsr r2, r2, #8 > >>>+ vdup.u8 d2, r3 @ I > >>>+ lsr r3, r3, #8 > >>>+ vdup.u8 d1, r2 @ E > >>>+ vdup.u8 d3, r3 @ I > > > >I tried implementing your suggestion with uzp here, but it ended up being > >slower actually. With the version of the patch I posted here: > > > >vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 185.0 139.0 > > > >With this block replaced with this: > > > > vdup.u16 q0, r2 @ E > > vdup.u16 q1, r3 @ I > > vuzp.u8 d0, d1 @ E > > vuzp.u8 d2, d3 @ I > > > >I get the following: > > > >vp9_loop_filter_mix2_v_44_16_neon: 223.2 150.5 186.1 142.0 > > > >I.e. 1-3 cycles slower on A7, A9 and A53, identical on A8. > > If I move the two vuzp further down, I get the following: > > vp9_loop_filter_mix2_v_44_16_neon: 223.2 148.5 185.1 141.0 > > I.e. +2 on A7, -2 on A8, 0 on A9, +2 on A53. So on average it's still worse, > even though it codewise is neater.
leave it as it was then Janne _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel