On Tue, 11 Oct 2016, Martin Storsjö wrote:

On Mon, 10 Oct 2016, Luca Barbato wrote:

On 10/10/2016 22:02, Martin Storsjö wrote:
Would the benchmark numbers be more readable/usable if presented as
relative speedup vs the C version? That'd cut the number of rows
in half, and making it easier to spot outliers or areas needing
more tuning.

The relative speedups for the functions above (plus the _hv variants)
are:
vp9_avg4_neon:                   1.66   0.69   1.56   1.77

Yes, looks much better.

While at it, I wonder if the A8 benchmark had some external factor to be
a speed regression.

No, I don't think so, I think it's just peculiarities of the A8 - it's quite reproducible. Hopefully Janne has got some tips on how to make it better.

Otherwise it's probably a tradeoff between whether one wants to keep it (since it gives a pretty decent speedup on the others) or just skip it. Or complicate things even further by having some condition for detecting the A8 (and similar cores?) and skipping it on them... Given that A8 is pretty rare these days (especially where anybody would want to decode VP9) I wouldn't make too much of a fuss out of it though.

I tried unrolling it another round, to process 4 lines at a time - then I get the following numbers:

                      A7    A8    A9   A53
vp9_avg4_8bpp_c:     62.4  45.2  46.2  47.7
vp9_avg4_8bpp_neon:  33.8  48.0  34.5  32.5

The prior version had these results:

vp9_avg4_c:          59.7  47.2  46.2  48.2
vp9_avg4_neon:       36.0  68.2  29.7  27.2

So it's a small gain on A7, large gain on A8 (but still marginally slower than the C code), and marginally slower on A9 and A53 (but still better than the C code). So perhaps that's a good tradeoff?

That's with a version that looks like this:

function ff_vp9_avg4_neon, export=1
       ldr             r12, [sp]
1:
       vld1.32         {d4[0]},  [r2], r3
       vld1.32         {d0[0]},  [r0], r1
       vld1.32         {d5[0]},  [r2], r3
       vrhadd.u8       d0,  d0,  d4
       vld1.32         {d1[0]},  [r0], r1
       vld1.32         {d6[0]},  [r2], r3
       vrhadd.u8       d1,  d1,  d5
       vld1.32         {d2[0]},  [r0], r1
       vld1.32         {d7[0]},  [r2], r3
       vrhadd.u8       d2,  d2,  d6
       vld1.32         {d3[0]},  [r0], r1
       sub             r0,  r0,  r1, lsl #2
       subs            r12, r12, #4
       vst1.32         {d0[0]},  [r0], r1
       vrhadd.u8       d3,  d3,  d7
       vst1.32         {d1[0]},  [r0], r1
       vst1.32         {d2[0]},  [r0], r1
       vst1.32         {d3[0]},  [r0], r1
       bne             1b
       bx              lr
endfunc

I forgot one other trick that helps a little on the A8; changing

        vld1.32         {d4[0]},  [r2], r3
into
        vld1.32         {d4[]},   [r2], r3
(when it doesn't matter what we load into the second half of the registers).

That gets the runtime down to this:
vp9_avg4_8bpp_c:     61.4  45.2  46.2  47.7
vp9_avg4_8bpp_neon:  33.8  44.0  37.5  32.5

So then it's finally faster on all of them.

// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to