On Wed, 2 Nov 2016, Martin Storsjö wrote:
+@ Instantiate a horizontal filter function for the given size. +@ This can work on 4, 8 or 16 pixels in parallel; for larger +@ widths it will do 16 pixels at a time and loop horizontally. +@ The actual width is passed in r5, the height in r4 and +@ the filter coefficients in r12. idx2 is the index of the largest +@ filter coefficient (3 or 4) and idx1 is the other one of them. +.macro do_8tap_h type, size, idx1, idx2 +function \type\()_8tap_\size\()h_\idx1\idx2 + sub r2, r2, #3 + add r6, r0, r1 + add r7, r2, r3 + add r1, r1, r1 + add r3, r3, r3 + @ Only size >= 16 loops horizontally and needs + @ reduced dst stride +.if \size >= 16 + sub r1, r1, r5 +.endif + @ size >= 16 loads two qwords and increments r2, + @ for size 4/8 it's enough with one qword and no + @ postincrement +.if \size >= 16 + sub r3, r3, r5 + sub r3, r3, #8 +.endif + @ Load the filter vector + vld1.16 {q0}, [r12,:128] +1: +.if \size >= 16 + mov r12, r5 +.endif + @ Load src +.if \size >= 16 + vld1.8 {q8}, [r2]! + vld1.8 {q11}, [r7]! + vld1.8 {d20}, [r2]! + vld1.8 {d26}, [r7]! +.else + vld1.8 {q8}, [r2] + vld1.8 {q11}, [r7] +.endif + vmovl.u8 q9, d17 + vmovl.u8 q8, d16 + vmovl.u8 q12, d23 + vmovl.u8 q11, d22 +.if \size >= 16 + vmovl.u8 q10, d20 + vmovl.u8 q13, d26 +.endif.if \size >= 16 vld1.8 {d18, d19, d20}, [r2]! vld1.8 {d24, d25, d26}, [r7]! .else vld1.8 {q9}, [r2] vld1.8 {q12}, [r7] .endif vmovl.u8 q8, d18 vmovl.u8 q9, d19 vmovl.u8 q11, d24 vmovl.u8 q12, d25 should be marginally fasterOh, nice - yes, that's a bit faster
I applied the same logic to the aarch64 version as well; there it's no speed change at all, but it's simpler.
+@ Instantiate a vertical filter function for filtering a 4 pixels wide+@ slice. The first half of the registers contain one row, while the second +@ half of a register contains the second-next row (also stored in the first+@ half of the register two steps ahead). The convolution does two outputs +@ at a time; the output of q5-q12 into one, and q4-q13 into another one. +@ The first half of first output is the first output row, the first half +@ of the other output is the second output row. The second halves of the +@ registers are rows 3 and 4. +@ This only is designed to work for 4 or 8 output lines. +.macro do_8tap_4v type, idx1, idx2 +function \type\()_8tap_4v_\idx1\idx2 + sub r2, r2, r3, lsl #1 + sub r2, r2, r3 + vld1.16 {q0}, [r12, :128] + + vld1.32 {d2[]}, [r2], r3 + vld1.32 {d3[]}, [r2], r3 + vld1.32 {d4[]}, [r2], r3 + vld1.32 {d5[]}, [r2], r3 + vld1.32 {d6[]}, [r2], r3 + vld1.32 {d7[]}, [r2], r3 + vext.8 d2, d2, d4, #4 + vld1.32 {d8[]}, [r2], r3 + vext.8 d3, d3, d5, #4 + vld1.32 {d9[]}, [r2], r3 + vmovl.u8 q5, d2 + vext.8 d4, d4, d6, #4 + vld1.32 {d28[]}, [r2], r3 + vmovl.u8 q6, d3 + vext.8 d5, d5, d7, #4 + vmovl.u8 q7, d4 + vext.8 d6, d6, d8, #4 + vld1.32 {d9[1]}, [r2], r3it probably makes sense to continue the vld1.32 {d[]}, vext.8 pattern. d30 and d31 should be free. It shouldn't be much slower for the height == 4 case and help for height == 8.Ah, yes. Around 1 cycle slower for height == 4, and around 9 cycles faster for height == 8.
Applied the same change to the aarch64 version as well. I'll push the arm version tomorrow unless there's more comments on it. // Martin _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
