On Tue, 11 Oct 2016, Martin Storsjö wrote:
On Mon, 10 Oct 2016, Luca Barbato wrote:
On 10/10/2016 22:02, Martin Storsjö wrote:
Would the benchmark numbers be more readable/usable if presented as
relative speedup vs the C version? That'd cut the number of rows
in half, and making it easier to spot outliers or areas needing
more tuning.
The relative speedups for the functions above (plus the _hv variants)
are:
vp9_avg4_neon: 1.66 0.69 1.56 1.77
Yes, looks much better.
While at it, I wonder if the A8 benchmark had some external factor to be
a speed regression.
No, I don't think so, I think it's just peculiarities of the A8 - it's quite
reproducible. Hopefully Janne has got some tips on how to make it better.
Otherwise it's probably a tradeoff between whether one wants to keep it
(since it gives a pretty decent speedup on the others) or just skip it. Or
complicate things even further by having some condition for detecting the A8
(and similar cores?) and skipping it on them... Given that A8 is pretty rare
these days (especially where anybody would want to decode VP9) I wouldn't
make too much of a fuss out of it though.
I tried unrolling it another round, to process 4 lines at a time - then I get
the following numbers:
A7 A8 A9 A53
vp9_avg4_8bpp_c: 62.4 45.2 46.2 47.7
vp9_avg4_8bpp_neon: 33.8 48.0 34.5 32.5
The prior version had these results:
vp9_avg4_c: 59.7 47.2 46.2 48.2
vp9_avg4_neon: 36.0 68.2 29.7 27.2
So it's a small gain on A7, large gain on A8 (but still marginally slower
than the C code), and marginally slower on A9 and A53 (but still better than
the C code). So perhaps that's a good tradeoff?
That's with a version that looks like this:
function ff_vp9_avg4_neon, export=1
ldr r12, [sp]
1:
vld1.32 {d4[0]}, [r2], r3
vld1.32 {d0[0]}, [r0], r1
vld1.32 {d5[0]}, [r2], r3
vrhadd.u8 d0, d0, d4
vld1.32 {d1[0]}, [r0], r1
vld1.32 {d6[0]}, [r2], r3
vrhadd.u8 d1, d1, d5
vld1.32 {d2[0]}, [r0], r1
vld1.32 {d7[0]}, [r2], r3
vrhadd.u8 d2, d2, d6
vld1.32 {d3[0]}, [r0], r1
sub r0, r0, r1, lsl #2
subs r12, r12, #4
vst1.32 {d0[0]}, [r0], r1
vrhadd.u8 d3, d3, d7
vst1.32 {d1[0]}, [r0], r1
vst1.32 {d2[0]}, [r0], r1
vst1.32 {d3[0]}, [r0], r1
bne 1b
bx lr
endfunc
I forgot one other trick that helps a little on the A8; changing
vld1.32 {d4[0]}, [r2], r3
into
vld1.32 {d4[]}, [r2], r3
(when it doesn't matter what we load into the second half of the
registers).
That gets the runtime down to this:
vp9_avg4_8bpp_c: 61.4 45.2 46.2 47.7
vp9_avg4_8bpp_neon: 33.8 44.0 37.5 32.5
So then it's finally faster on all of them.
// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel