On 2016-11-28 11:26:01 +0200, Martin Storsjö wrote:
> This work is sponsored by, and copyright, Google.
> 
> Previously all subpartitions except the eob=1 (DC) case ran with
> the same runtime:
> 
> vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
> vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3
> 
> By skipping individual 4x16 or 4x32 pixel slices in the first pass,
> we reduce the runtime of these functions like this:
> 
> vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
> vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
> vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
> vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
> vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
> vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
> vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
> vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
> vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
> vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
> vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
> vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
> vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
> vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
> vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
> vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1
> 
> I.e. in general a very minor overhead for the full subpartition case due
> to the additional loads and cmps, but a significant speedup for the cases
> when we only need to process a small part of the actual input data.
> 
> In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
> 16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
> 8x8 or 16x16 subpartitions respectively.
> ---
> Updated with Janne's suggestions. The weird speedup for
> vp9_inv_dct_dct_16x16_sub16_add_neon on the Cortex A8 in the previous
> iteration of the patch seems to be mostly within noise for that test; it
> does still appear occasionally when testing.
> ---
>  libavcodec/arm/vp9itxfm_neon.S | 75 
> +++++++++++++++++++++++++++++++++++++-----
>  tests/checkasm/vp9dsp.c        |  6 ++--
>  2 files changed, 70 insertions(+), 11 deletions(-)

ok

Janne
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to