On Sun, 5 Feb 2017, Martin Storsjö wrote:
On Sun, 5 Feb 2017, Janne Grunau wrote:
On 2016-12-01 11:27:02 +0200, Martin Storsjö wrote:
This work is sponsored by, and copyright, Google.
This makes it easier to avoid filling the temp buffer with zeros for the
skipped slices, and leads to slightly more straightforward code for these
cases (for the 16x16 case, where the special case pass functions are
written out instead of templated from the same macro), instead of riddling
the common code with special case branches or macro .ifs.
The code size increases from 14740 bytes to 24472 bytes.
Before:
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1390.3
vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 556.5
vp9_inv_dct_dct_32x32_sub2_add_neon: 5199.1
vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9
vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.9
vp9_inv_dct_dct_32x32_sub12_add_neon: 6171.6
vp9_inv_dct_dct_32x32_sub16_add_neon: 6170.9
vp9_inv_dct_dct_32x32_sub20_add_neon: 7147.1
vp9_inv_dct_dct_32x32_sub24_add_neon: 7147.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 8118.8
vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 845.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1389.4
vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 556.5
vp9_inv_dct_dct_32x32_sub2_add_neon: 3684.1
vp9_inv_dct_dct_32x32_sub4_add_neon: 3682.6
vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.1
vp9_inv_dct_dct_32x32_sub12_add_neon: 5319.0
vp9_inv_dct_dct_32x32_sub16_add_neon: 5323.5
vp9_inv_dct_dct_32x32_sub20_add_neon: 7149.8
vp9_inv_dct_dct_32x32_sub24_add_neon: 7148.2
vp9_inv_dct_dct_32x32_sub28_add_neon: 8124.5
vp9_inv_dct_dct_32x32_sub32_add_neon: 8122.1
---
If we wouldn't have made the core transforms standalone functions,
the code size would end up at around 34 KB.
The binary output is 6 KB larger than in the other alternative,
but is more straightforward and gives better opportunities to
special case them further.
In the arm version, there was a significant speedup compared to the
other alternative (having cmps within the functions), skipping
zeroing of the temp buffer. Here there's much less difference.
And the relative binary size difference is even larger. It would a
little strange to choose different alternatives for 32- and 64-bit but
it sounds like alternative 1 might be better for arm64. Please run a
full decoding benchmark for arm64 too.
Yeah, I need to do more extensive full benchmarks to know whether it's
worth it or not. The difference in the arm case seemed bigger than it
should be based on checkasm numbers as well, so perhaps I need to run a
few more iterations to get more correct values.
Ok, so after running a slightly shorter clip (which seems to have about as
large percentage of runtime doing IDCT as the previous one) with a bit
more iterations, I've got the following results (the 'user' part from
'time avconv -threads 1 -i foo -f null -'):
32 orig 32 alt1 32 alt2 64 orig 64 alt1 64 alt2
40.436s 40.148s 40.008s 37.428s 37.356s 37.192s
40.596s 40.140s 40.216s 37.572s 37.524s 37.384s
40.512s 40.228s 40.188s 37.740s 37.588s 37.368s
40.584s 40.136s 40.216s 37.880s 37.492s 37.348s
40.572s 40.292s 40.232s 37.756s 37.556s 37.676s
40.764s 40.312s 40.232s 37.876s 37.640s 37.468s
40.688s 40.284s 40.368s 37.972s 37.608s 37.460s
So while alt2 is faster in most runs, the margin is not quite as big as in
the previous benchmark. (The benchmarks were done on a practically
unloaded system so it shouldn't vary too much from run to run, but in
practice, the first few runs seem to be slightly faster than the later
ones.)
I.e. around 400 ms gain out of 40 s for alt1, and then another -50 - +150
ms speedup on top of that for alt2.
What do you think?
// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel