Re: [libav-devel] [PATCH 1/2] aarch64: vp9: Add NEON itxfm routines

Martin Storsjö Sat, 12 Nov 2016 14:31:36 -0800

On Sat, 12 Nov 2016, Janne Grunau wrote:

On 2016-11-11 21:43:08 +0200, Martin Storsjö wrote:

This work is sponsored by, and copyright, Google.


These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the 16x16 and 32x32
transforms in slices 8 pixels wide instead of 4. This gives
a speedup of around 1.4x compared to the 32 bit version.

The fact that aarch64 doesn't have the same d/q register
aliasing makes some of the macros quite a bit simpler as well.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                       ARM  AArch64
vp9_inv_adst_adst_4x4_add_neon:       90.0     87.7
vp9_inv_adst_adst_8x8_add_neon:      400.0    354.7
vp9_inv_adst_adst_16x16_add_neon:   2526.5   1868.3
vp9_inv_dct_dct_4x4_add_neon:         74.0     72.7
vp9_inv_dct_dct_8x8_add_neon:        271.0    256.7
vp9_inv_dct_dct_16x16_add_neon:     1960.7   1393.0
vp9_inv_dct_dct_32x32_add_neon:    11988.9   8375.0
vp9_inv_wht_wht_4x4_add_neon:         63.0     57.7

The speedup vs C code (2-4x) is smaller than in the 32 bit case,
mostly because the C code ends up significantly faster (around
1.6x faster, with GCC 5.4) when built for aarch64.


benchmarks on Cortex-A57 and Apple A7. There is no access to the cycle
counter on ios and the high performance timer has a resolution of
125/3 ns. The timer is scaled to the number of cycles per timer tick. To
counter the courseness the number repetions in checkasm is increased to
108. So please take the A7 numbers with a large grain of salt.
Especially since the A7 is sometimes faster than the Cortex-A57 (as
expected) but is slower on others.

                                A57 gcc-5.3   neon A7 xcode-7.2   neon
vp9_inv_adst_adst_4x4_add_neon:       152.2   60.0      288.7     64.8
vp9_inv_adst_adst_8x8_add_neon:       948.2  288.0     1653.2    427.4
vp9_inv_adst_adst_16x16_add_neon:    4830.4 1380.5     8840.9   1585.5
vp9_inv_dct_dct_4x4_add_neon:         153.0   58.6      273.9     57.6
vp9_inv_dct_dct_8x8_add_neon:         789.2  180.2     1293.2    179.8
vp9_inv_dct_dct_16x16_add_neon:      3639.6  917.1     6996.3    760.6
vp9_inv_dct_dct_32x32_add_neon:     20462.1 4985.0    35520.2   5705.2
vp9_inv_wht_wht_4x4_add_neon:          91.0   49.8      179.9     66.0

The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
is around 30-50% faster on the a57 compared to the a53.


Thanks! I included this in the updated commit message.

+// out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
+// out2 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14
+// in/out are .8h registers; this can do with 4 temp registers, but is
+// more efficient if 6 temp registers are available.
+.macro dmbutterfly0 out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, 
neg=0
+        add             \tmp1\().8h, \in1\().8h,  \in2\().8h
+        sub             \tmp2\().8h, \in1\().8h,  \in2\().8h
+        smull           \tmp3\().4s, \tmp1\().4h, v0.h[0]
+        smull2          \tmp4\().4s, \tmp1\().8h, v0.h[0]
+.ifb \tmp5
+.if \neg > 0
+        neg             \tmp3\().4s, \tmp3\().4s
+        neg             \tmp4\().4s, \tmp4\().4s


negating \tmp1 before the multiplication is unfortunately slower but
negating v0.4h into tmp6 and using that for the multiplication is ~10
cycles faster for vp9_inv_adst_adst_16x16_add and
vp9_inv_dct_adst_16x16_add and strangely 1 cycle slower
vp9_inv_adst_dct_16x16_add on cortex-a57. iadst16 is the only place
there the negated variant is used. It is used twice but unfortunately
there's no free register in v0-v15 so we have to use a tmp register.

Hmm, indeed. (I used tmp4 instead of tmp6 though, to make it work in caseit'd be called without tmp5-tmp6, even though we currently don't.)

On the dragonboard, this got 10 cycles faster invp9_inv_adst_adst_16x16_add and 1 cycle faster forvp9_inv_dct_adst_16x16_add and vp9_inv_adst_dct_16x16_add.

+.ifc \txfm1,idct
+.ifc \txfm2,idct


.ifc \txfm1_\txfm2, idct_idct

looks a little nicer and avoids a .ifc/.endif pair, please apply
everywhere


Oh, great! That does indeed make things a lot nicer.

+.macro itxfm_func16x16 txfm1, txfm2
+function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1
+.ifc \txfm1,idct
+.ifc \txfm2,idct
+        cmp             x3,  #1
+        b.eq            idct16x16_dc_add_neon
+.endif
+.endif
+        stp             x29, x30, [sp, #-16]


This violates the universal stack constraints in AAPCS64.  A Process may
only access the interval [SP, stack_base - 1]. Technically '.. [sp,
#-16]!' would violate it too since acoording to the pseudo code in the
reference manual the writeback happens after the load/store but gcc
generetates such code.

Thanks for noticing! I avoided all of the issue by just moving x30 intox15 instead, and returning with "br x15" instead of ret.

+        mov             x29, sp
+        sub             sp,  sp,  #0x50
+        // iadst16 requires clobbering v8-v15, but idct16 doesn't need to.
+.ifc \txfm1,iadst


.ifnc \txfm1_\txfm2, idct_idct

avoids the duplication or the macro could get another paramter if it
needs to preserve d8-d15

Yep, I got rid of a lot of extra duplication thanks to this, not onlyhere.

+        mov             sp,  x29
+        ldp             x29, x30, [sp, #-16]


stack access


Fixed, and all the other ones as well

+.macro idct32_odd
+        movrel          x9,  idct_coeffs
+        add             x9,  x9,  #32


add to a free gp reg so you don't have to subtract later or add a label
for this and subtract 32 after loading from it. Or even better there
should be enough gpr to keep the address of idct_coeffs always in the
same register and for the 32x32 idct_coeffs+32 in another one.

Done; using a separate gpr (aarch64 sure is nice with so many extra gpr's)for both, getting rid of a bunch of extra movrel's. Also used the sametrick for iadst/idct16, where I also could avoid a duplicate load ofidct_coeffs in idct_idct, by thinking a bit more about what I'm doing.

+.macro store_rev a, b
+.irp i, \a, \b
+        store           \i, x0, #16
+        rev128_i        \i
+.endr
+.irp i, \b, \a
+        store           \i, x0, #16


I had limited speed-ups with writing this out and mixing the stores with
the rev128 by reversion into one of the free registers. I should have
probably should have tested on cortex-a53 instead of the a57. Writing
this out is is imho clearer and 1 line shorter.


Indeed, I got a decent speedup thanks to this

+.endr
+.endm
+        store_rev       16, 24
+        store_rev       17, 25
+        store_rev       18, 26
+        store_rev       19, 27
+        store_rev       20, 28
+        store_rev       21, 29
+        store_rev       22, 30
+        store_rev       23, 31
+        sub             x0,  x0,  #512
+.purgem store_rev
+
+        // Move x2 back to the start of the input, and move
+        // to the first odd row
+        sub             x2,  x2,  x9, lsl #4
+        add             x2,  x2,  #64
+
+        movi            v4.8h, #0
+        // v16 = IN(1), v17 = IN(3) ... v31 = IN(31)
+.irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
+        ld1             {v\i\().8h}, [x2]
+        st1             {v4.8h},  [x2], x9
+.endr
+
+        idct32_odd
+


I saw some speed-ups with prefetching x0 from here

I'm unable to measure any speedup on the A53 with prefetching here, so I'mholding off doing that. That's after improving the store_rev below though.

+        transpose_8x8H v31, v30, v29, v28, v27, v26, v25, v24, v2, v3
+        transpose_8x8H v23, v22, v21, v20, v19, v18, v17, v16, v2, v3
+
+        // Helper macros to fix using concatenation with the irp loop variable
+.macro add_8h a, b, c
+        add             v\a\().8h, v\b\().8h, v\c\().8h
+.endm
+.macro sub_8h a, b, c
+        sub             v\a\().8h, v\b\().8h, v\c\().8h
+.endm
+        // Store the registers a, b horizontally,
+        // adding into the output first, and the mirrored,
+        // subtracted from the output.
+.macro store_rev a, b
+.irp i, \a, \b
+        ld1             {v4.8h},  [x0]
+        add_8h          4, 4, \i
+        st1             {v4.8h},  [x0], #16
+        rev128_i        \i
+.endr
+.irp i, \b, \a
+        ld1             {v4.8h},  [x0]
+        sub_8h          4, 4, \i
+        st1             {v4.8h},  [x0], #16
+.endr


I thought I could get this faster by using free neon register and
combining loads/store but it wasn't much faster. I probably shouldn't
have used the the out-of-order cortex-a57 for testing since it's
frustating. No need to do anything here. I'll play with it further post
commit.

I got some sort of speedup here by using a free neon register andreordering things for the in-order cores.

+function ff_vp9_idct_idct_32x32_add_neon, export=1
+        cmp             x3,  #1
+        b.eq            idct32x32_dc_add_neon
+
+        stp             x29, x30, [sp, #-16]


stack access


Fixed

+        mov             sp,  x29
+        ldp             x29, x30, [sp, #-16]


stack access


Fixed

otherwise ok

Thanks - I'll post the updated patch which I'll push tomorrow, and a patchfor the arm version to simplify the .ifc soup there as well.


// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 1/2] aarch64: vp9: Add NEON itxfm routines

Reply via email to