quarter idct16 and idct32 (alternative 2)

Martin Storsjö Sun, 05 Feb 2017 04:11:03 -0800

On Sun, 5 Feb 2017, Janne Grunau wrote:

On 2016-12-01 11:27:02 +0200, Martin Storsjö wrote:

This work is sponsored by, and copyright, Google.


This makes it easier to avoid filling the temp buffer with zeros for the
skipped slices, and leads to slightly more straightforward code for these
cases (for the 16x16 case, where the special case pass functions are
written out instead of templated from the same macro), instead of riddling
the common code with special case branches or macro .ifs.

The code size increases from 14740 bytes to 24472 bytes.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1390.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     556.5
vp9_inv_dct_dct_32x32_sub2_add_neon:    5199.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.9
vp9_inv_dct_dct_32x32_sub12_add_neon:   6171.6
vp9_inv_dct_dct_32x32_sub16_add_neon:   6170.9
vp9_inv_dct_dct_32x32_sub20_add_neon:   7147.1
vp9_inv_dct_dct_32x32_sub24_add_neon:   7147.0
vp9_inv_dct_dct_32x32_sub28_add_neon:   8118.8
vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub8_add_neon:     845.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1389.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon:     556.5
vp9_inv_dct_dct_32x32_sub2_add_neon:    3684.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    3682.6
vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.1
vp9_inv_dct_dct_32x32_sub12_add_neon:   5319.0
vp9_inv_dct_dct_32x32_sub16_add_neon:   5323.5
vp9_inv_dct_dct_32x32_sub20_add_neon:   7149.8
vp9_inv_dct_dct_32x32_sub24_add_neon:   7148.2
vp9_inv_dct_dct_32x32_sub28_add_neon:   8124.5
vp9_inv_dct_dct_32x32_sub32_add_neon:   8122.1

---
If we wouldn't have made the core transforms standalone functions,
the code size would end up at around 34 KB.

The binary output is 6 KB larger than in the other alternative,
but is more straightforward and gives better opportunities to
special case them further.

In the arm version, there was a significant speedup compared to the
other alternative (having cmps within the functions), skipping
zeroing of the temp buffer. Here there's much less difference.

And the relative binary size difference is even larger. It would alittle strange to choose different alternatives for 32- and 64-bit butit sounds like alternative 1 might be better for arm64. Please run afull decoding benchmark for arm64 too.

Yeah, I need to do more extensive full benchmarks to know whether it'sworth it or not. The difference in the arm case seemed bigger than itshould be based on checkasm numbers as well, so perhaps I need to run afew more iterations to get more correct values.

+endfunc


this two should be templated


Will do

 function ff_vp9_idct_idct_32x32_add_neon, export=1
         cmp             w3,  #1
         b.eq            idct32x32_dc_add_neon
saving d8-d15 should be done here, saves duplicating it in thequarter/half variants. same for the idct_coeffs and other stuff.


Ok, will try

// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 (alternative 2)

Reply via email to