On 2017-02-09 13:39:55 +0200, Martin Storsjö wrote: > The idct32x32 function actually backed up and restored q4-q7 even > though it didn't clobber them; there are plenty of registers that > can be used to allow keeping all the idct coefficients in registers > without having to reload different subsets of them at different > stages in the transform. > > Since the idct16 core transform avoids clobbering q4-q7 (but clobbers > q2-q3 instead, to avoid needing to back up and restore q4-q7 at all > in the idct16 function), and the lanewise vmul needs a register in > the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5 > while doing idct16. > > While keeping these coefficients in registers, we still can skip backing > up and restoring q7. > > Before: Cortex A7 A8 A9 A53 > vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7 > After: > vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8 > --- > libavcodec/arm/vp9itxfm_neon.S | 246 > ++++++++++++++++++++--------------------- > 1 file changed, 120 insertions(+), 126 deletions(-)
ok Janne _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel