On Sun, 5 Feb 2017, Janne Grunau wrote:
On 2016-12-01 11:27:02 +0200, Martin Storsjö wrote:
This work is sponsored by, and copyright, Google.
This makes it easier to avoid filling the temp buffer with zeros for the
skipped slices, and leads to slightly more straightforward code for these
cases (for the 16x16 case, where the special case pass functions are
written out instead of templated from the same macro), instead of riddling
the common code with special case branches or macro .ifs.
The code size increases from 14740 bytes to 24472 bytes.
Before:
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1390.3
vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 556.5
vp9_inv_dct_dct_32x32_sub2_add_neon: 5199.1
vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9
vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.9
vp9_inv_dct_dct_32x32_sub12_add_neon: 6171.6
vp9_inv_dct_dct_32x32_sub16_add_neon: 6170.9
vp9_inv_dct_dct_32x32_sub20_add_neon: 7147.1
vp9_inv_dct_dct_32x32_sub24_add_neon: 7147.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 8118.8
vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 845.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1389.4
vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 556.5
vp9_inv_dct_dct_32x32_sub2_add_neon: 3684.1
vp9_inv_dct_dct_32x32_sub4_add_neon: 3682.6
vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.1
vp9_inv_dct_dct_32x32_sub12_add_neon: 5319.0
vp9_inv_dct_dct_32x32_sub16_add_neon: 5323.5
vp9_inv_dct_dct_32x32_sub20_add_neon: 7149.8
vp9_inv_dct_dct_32x32_sub24_add_neon: 7148.2
vp9_inv_dct_dct_32x32_sub28_add_neon: 8124.5
vp9_inv_dct_dct_32x32_sub32_add_neon: 8122.1
---
If we wouldn't have made the core transforms standalone functions,
the code size would end up at around 34 KB.
The binary output is 6 KB larger than in the other alternative,
but is more straightforward and gives better opportunities to
special case them further.
In the arm version, there was a significant speedup compared to the
other alternative (having cmps within the functions), skipping
zeroing of the temp buffer. Here there's much less difference.
And the relative binary size difference is even larger. It would a
little strange to choose different alternatives for 32- and 64-bit but
it sounds like alternative 1 might be better for arm64. Please run a
full decoding benchmark for arm64 too.
Yeah, I need to do more extensive full benchmarks to know whether it's
worth it or not. The difference in the arm case seemed bigger than it
should be based on checkasm numbers as well, so perhaps I need to run a
few more iterations to get more correct values.
+endfunc
this two should be templated
Will do
function ff_vp9_idct_idct_32x32_add_neon, export=1
cmp w3, #1
b.eq idct32x32_dc_add_neon
saving d8-d15 should be done here, saves duplicating it in the
quarter/half variants. same for the idct_coeffs and other stuff.
Ok, will try
// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel