Hi, (again, thanks to both of you for documenting all this assembly /NEON code)
On 09/04/2016 10:22, Matthieu Bouron wrote:
From: Matthieu Bouron <matthieu.bou...@stupeflix.com> --- Hello, The following patch add yuv2planeX_8_neon function for the arm platform. It is currently restricted to 8-bit per component sources until I fix fate issues with 10-bit sources (the dnxhd-*-10bit tests fail but I haven't figured out yet where it comes from). Matthieu --- libswscale/arm/Makefile | 1 + libswscale/arm/output.S | 78 ++++++++++++++++++++++++++++++++++++++++++++++++ libswscale/arm/swscale.c | 7 +++++ libswscale/utils.c | 3 +- 4 files changed, 88 insertions(+), 1 deletion(-) create mode 100644 libswscale/arm/output.S [...] diff --git a/libswscale/arm/output.S b/libswscale/arm/output.S new file mode 100644 index 0000000..4437447 --- /dev/null +++ b/libswscale/arm/output.S @@ -0,0 +1,78 @@
[...]
+function ff_yuv2planeX_8_neon, export=1 + push {r4-r12, lr} + vpush {q4-q7} + ldr r4, [sp, #104] @ dstW + ldr r5, [sp, #108] @ dither + ldr r6, [sp, #112] @ offset + vld1.8 {d0}, [r5] @ load 8x8-bit dither values + tst r6, #0 @ check offsetting which can be 0 or 3 only + beq 1f + vext.u8 d0, d0, d0, #3 @ honor offseting which can be 3 only +1: vmovl.u8 q0, d0 @ extend dither to 16-bit + vshll.u16 q1, d0, #12 @ extend dither to 32-bit with left shift by 12 (part 1) + vshll.u16 q2, d1, #12 @ extend dither to 32-bit with left shift by 12 (part 2) + mov r7, #0 @ i = 0 +2: vmov.u8 q3, q1 @ initialize accumulator with dithering values (part 1) + vmov.u8 q4, q2 @ initialize accumulator with dithering values (part 2) + mov r8, r1 @ tmpFilterSize = filterSize + mov r9, r2 @ srcp + mov r10, r0 @ filterp +3: ldr r11, [r9], #4 @ get pointer @ src[j] + ldr r12, [r9], #4 @ get pointer @ src[j+1] + add r11, r11, r7, lsl #1 @ &src[j][i] + add r12, r12, r7, lsl #1 @ &src[j+1][i] + vld1.16 {q5}, [r11] @ read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H + vld1.16 {q6}, [r12] @ read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P + ldr r11, [r10], #4 @ read 2x16-bit coeffs (X, Y) at (filter[j], filter[j+1]) + vmov.16 q7, q5 @ copy 8x16-bit @ src[j ][i + {0..7}] for following inplace zip instruction + vmov.16 q8, q6 @ copy 8x16-bit @ src[j+1][i + {0..7}] for following inplace zip instruction + vzip.16 q7, q8 @ A,I,B,J,C,K,D,L,E,M,F,N,G,O,H,L
nit: O,H,P -- Ben _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel