On 2017-02-09 13:39:55 +0200, Martin Storsjö wrote:
> The idct32x32 function actually backed up and restored q4-q7 even
> though it didn't clobber them; there are plenty of registers that
> can be used to allow keeping all the idct coefficients in registers
> without having to reload different subsets of them at different
> stages in the transform.
> 
> Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
> q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
> in the idct16 function), and the lanewise vmul needs a register in
> the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
> while doing idct16.
> 
> While keeping these coefficients in registers, we still can skip backing
> up and restoring q7.
> 
> Before:                              Cortex A7       A8       A9      A53
> vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
> After:
> vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
> ---
>  libavcodec/arm/vp9itxfm_neon.S | 246 
> ++++++++++++++++++++---------------------
>  1 file changed, 120 insertions(+), 126 deletions(-)

ok

Janne
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to