quarter idct16 and idct32 (alternative 2)

Martin Storsjö Sun, 05 Feb 2017 14:17:35 -0800

On Sun, 5 Feb 2017, Martin Storsjö wrote:

On Sun, 5 Feb 2017, Janne Grunau wrote:

On 2016-12-01 11:27:02 +0200, Martin Storsjö wrote:

This work is sponsored by, and copyright, Google.

This makes it easier to avoid filling the temp buffer with zeros for the
skipped slices, and leads to slightly more straightforward code for these
cases (for the 16x16 case, where the special case pass functions are
written out instead of templated from the same macro), instead of riddling
the common code with special case branches or macro .ifs.

The code size increases from 14740 bytes to 24472 bytes.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1390.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     556.5
vp9_inv_dct_dct_32x32_sub2_add_neon:    5199.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.9
vp9_inv_dct_dct_32x32_sub12_add_neon:   6171.6
vp9_inv_dct_dct_32x32_sub16_add_neon:   6170.9
vp9_inv_dct_dct_32x32_sub20_add_neon:   7147.1
vp9_inv_dct_dct_32x32_sub24_add_neon:   7147.0
vp9_inv_dct_dct_32x32_sub28_add_neon:   8118.8
vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub8_add_neon:     845.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1389.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon:     556.5
vp9_inv_dct_dct_32x32_sub2_add_neon:    3684.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    3682.6
vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.1
vp9_inv_dct_dct_32x32_sub12_add_neon:   5319.0
vp9_inv_dct_dct_32x32_sub16_add_neon:   5323.5
vp9_inv_dct_dct_32x32_sub20_add_neon:   7149.8
vp9_inv_dct_dct_32x32_sub24_add_neon:   7148.2
vp9_inv_dct_dct_32x32_sub28_add_neon:   8124.5
vp9_inv_dct_dct_32x32_sub32_add_neon:   8122.1

---
If we wouldn't have made the core transforms standalone functions,
the code size would end up at around 34 KB.

The binary output is 6 KB larger than in the other alternative,
but is more straightforward and gives better opportunities to
special case them further.

In the arm version, there was a significant speedup compared to the
other alternative (having cmps within the functions), skipping
zeroing of the temp buffer. Here there's much less difference.

And the relative binary size difference is even larger. It would alittle strange to choose different alternatives for 32- and 64-bit butit sounds like alternative 1 might be better for arm64. Please run afull decoding benchmark for arm64 too.

Yeah, I need to do more extensive full benchmarks to know whether it'sworth it or not. The difference in the arm case seemed bigger than itshould be based on checkasm numbers as well, so perhaps I need to run afew more iterations to get more correct values.

Ok, so after running a slightly shorter clip (which seems to have about aslarge percentage of runtime doing IDCT as the previous one) with a bitmore iterations, I've got the following results (the 'user' part from'time avconv -threads 1 -i foo -f null -'):


32 orig   32 alt1   32 alt2   64 orig   64 alt1   64 alt2
40.436s   40.148s   40.008s   37.428s   37.356s   37.192s
40.596s   40.140s   40.216s   37.572s   37.524s   37.384s
40.512s   40.228s   40.188s   37.740s   37.588s   37.368s
40.584s   40.136s   40.216s   37.880s   37.492s   37.348s
40.572s   40.292s   40.232s   37.756s   37.556s   37.676s
40.764s   40.312s   40.232s   37.876s   37.640s   37.468s
40.688s   40.284s   40.368s   37.972s   37.608s   37.460s

So while alt2 is faster in most runs, the margin is not quite as big as inthe previous benchmark. (The benchmarks were done on a practicallyunloaded system so it shouldn't vary too much from run to run, but inpractice, the first few runs seem to be slightly faster than the laterones.)

I.e. around 400 ms gain out of 40 s for alt1, and then another -50 - +150ms speedup on top of that for alt2.


What do you think?

// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 (alternative 2)

Reply via email to