Re: [libav-devel] [PATCHv4] arm: vp9: Add NEON optimizations of VP9 MC functions

Martin Storsjö Wed, 02 Nov 2016 14:15:09 -0700

On Wed, 2 Nov 2016, Martin Storsjö wrote:

+@ Instantiate a horizontal filter function for the given size.
+@ This can work on 4, 8 or 16 pixels in parallel; for larger
+@ widths it will do 16 pixels at a time and loop horizontally.
+@ The actual width is passed in r5, the height in r4 and
+@ the filter coefficients in r12. idx2 is the index of the largest
+@ filter coefficient (3 or 4) and idx1 is the other one of them.
+.macro do_8tap_h type, size, idx1, idx2
+function \type\()_8tap_\size\()h_\idx1\idx2
+        sub             r2,  r2,  #3
+        add             r6,  r0,  r1
+        add             r7,  r2,  r3
+        add             r1,  r1,  r1
+        add             r3,  r3,  r3
+        @ Only size >= 16 loops horizontally and needs
+        @ reduced dst stride
+.if \size >= 16
+        sub             r1,  r1,  r5
+.endif
+        @ size >= 16 loads two qwords and increments r2,
+        @ for size 4/8 it's enough with one qword and no
+        @ postincrement
+.if \size >= 16
+        sub             r3,  r3,  r5
+        sub             r3,  r3,  #8
+.endif
+        @ Load the filter vector
+        vld1.16         {q0},  [r12,:128]
+1:
+.if \size >= 16
+        mov             r12, r5
+.endif
+        @ Load src
+.if \size >= 16
+        vld1.8          {q8},  [r2]!
+        vld1.8          {q11}, [r7]!
+        vld1.8          {d20}, [r2]!
+        vld1.8          {d26}, [r7]!
+.else
+        vld1.8          {q8},  [r2]
+        vld1.8          {q11}, [r7]
+.endif
+        vmovl.u8        q9,  d17
+        vmovl.u8        q8,  d16
+        vmovl.u8        q12, d23
+        vmovl.u8        q11, d22
+.if \size >= 16
+        vmovl.u8        q10, d20
+        vmovl.u8        q13, d26
+.endif


.if \size >= 16
 vld1.8   {d18, d19, d20}, [r2]!
 vld1.8   {d24, d25, d26}, [r7]!
.else
 vld1.8   {q9},  [r2]
 vld1.8   {q12}, [r7]
.endif
 vmovl.u8 q8,  d18
 vmovl.u8 q9,  d19
 vmovl.u8 q11, d24
 vmovl.u8 q12, d25

should be marginally faster


Oh, nice - yes, that's a bit faster

I applied the same logic to the aarch64 version as well; there it's nospeed change at all, but it's simpler.

+@ Instantiate a vertical filter function for filtering a 4 pixels wide

+@ slice. The first half of the registers contain one row, while thesecond+@ half of a register contains the second-next row (also stored in thefirst

+@ half of the register two steps ahead). The convolution does two outputs
+@ at a time; the output of q5-q12 into one, and q4-q13 into another one.
+@ The first half of first output is the first output row, the first half
+@ of the other output is the second output row. The second halves of the
+@ registers are rows 3 and 4.
+@ This only is designed to work for 4 or 8 output lines.
+.macro do_8tap_4v type, idx1, idx2
+function \type\()_8tap_4v_\idx1\idx2
+        sub             r2,  r2,  r3, lsl #1
+        sub             r2,  r2,  r3
+        vld1.16         {q0},  [r12, :128]
+
+        vld1.32         {d2[]},   [r2], r3
+        vld1.32         {d3[]},   [r2], r3
+        vld1.32         {d4[]},   [r2], r3
+        vld1.32         {d5[]},   [r2], r3
+        vld1.32         {d6[]},   [r2], r3
+        vld1.32         {d7[]},   [r2], r3
+        vext.8          d2,  d2,  d4,  #4
+        vld1.32         {d8[]},   [r2], r3
+        vext.8          d3,  d3,  d5,  #4
+        vld1.32         {d9[]},   [r2], r3
+        vmovl.u8        q5,  d2
+        vext.8          d4,  d4,  d6,  #4
+        vld1.32         {d28[]},  [r2], r3
+        vmovl.u8        q6,  d3
+        vext.8          d5,  d5,  d7,  #4
+        vmovl.u8        q7,  d4
+        vext.8          d6,  d6,  d8,  #4
+        vld1.32         {d9[1]},  [r2], r3


it probably makes sense to continue the vld1.32 {d[]}, vext.8 pattern.
d30 and d31 should be free. It shouldn't be much slower for the height
== 4 case and help for height == 8.

Ah, yes. Around 1 cycle slower for height == 4, and around 9 cycles fasterfor height == 8.


Applied the same change to the aarch64 version as well.

I'll push the arm version tomorrow unless there's more comments on it.

// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCHv4] arm: vp9: Add NEON optimizations of VP9 MC functions

Reply via email to