On 2016-11-13 00:32:17 +0200, Martin Storsjö wrote: > This work is sponsored by, and copyright, Google. > > These are ported from the ARM version; thanks to the larger > amount of registers available, we can do the 16x16 and 32x32 > transforms in slices 8 pixels wide instead of 4. This gives > a speedup of around 1.4x compared to the 32 bit version. > > The fact that aarch64 doesn't have the same d/q register > aliasing makes some of the macros quite a bit simpler as well. > > Examples of runtimes vs the 32 bit version, on a Cortex A53: > ARM AArch64 > vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 > vp9_inv_adst_adst_8x8_add_neon: 400.0 354.8 > vp9_inv_adst_adst_16x16_add_neon: 2526.5 1833.4 > vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 > vp9_inv_dct_dct_8x8_add_neon: 271.0 256.8 > vp9_inv_dct_dct_16x16_add_neon: 1960.7 1375.5 > vp9_inv_dct_dct_32x32_add_neon: 11988.9 8117.1 > vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 > > The speedup vs C code (2-4x) is smaller than in the 32 bit case, > mostly because the C code ends up significantly faster (around > 1.6x faster, with GCC 5.4) when built for aarch64. > > Examples of runtimes vs C on a Cortex A57 and an Apple A7: > A57 gcc-5.3 neon A7 xcode-7.2 neon > vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 288.7 64.8 > vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 1653.2 427.4 > vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5 8840.9 1585.5 > vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 273.9 57.6 > vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 1293.2 179.8 > vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 6996.3 760.6 > vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 35520.2 5705.2 > vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 179.9 66.0 > > The asm is around factor 3-4 faster than C on the cortex-a57 and the asm > is around 30-50% faster on the a57 compared to the a53. > > --- > v3: Applied Janne's review comments from the aarch64 version. > Around 250 cycles speedup for idct32, 18 cycles for dct_dct_16, > 35 cycles for adst_adst_16. Will push tomorrow based on Janne's ok > on the previous version. > > Included Janne's benchmarks in the commit message, even though they're > from the previous version. > > v2: Updated based on the review for the arm version. Added newlines > between macros, removed the _neg macro, rescheduled idct4, loading > iadst8_coeffs+idct_coeffs by incrementing the pointer, instead of > using two movrels, using two registers for load+add+store where it > does help, removed unused labels. > > Using cbz instead of cmp+beq (since cbz is thumb-only in 32 bit), > added a missing comma in a macro invocation. > --- > libavcodec/aarch64/Makefile | 3 +- > libavcodec/aarch64/vp9dsp_init_aarch64.c | 51 +- > libavcodec/aarch64/vp9itxfm_neon.S | 1120 > ++++++++++++++++++++++++++++++ > 3 files changed, 1172 insertions(+), 2 deletions(-) > create mode 100644 libavcodec/aarch64/vp9itxfm_neon.S
patch ok Janne _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
