quarter idct16 and idct32 (alternative 2)

Martin Storsjö Sat, 04 Feb 2017 14:34:28 -0800

On Sat, 4 Feb 2017, Janne Grunau wrote:

On 2016-12-01 11:26:58 +0200, Martin Storsjö wrote:

+        @ Store the transposed 4x4 blocks horizontally.
+        @ The first 4x4 block is kept in registers for the second pass,
+        @ store the rest in the temp buffer.
+        add             r0,  r0,  #8
+.irp i, 20, 24, 28
+        vst1.16         {d\i}, [r0,:64]!
+.endr

just writing the .irp out would use the same number of lines and is imhoclearer, the same apply below

Hmm, indeed. This also exists in the current code, so I sent a separatepatch fixing that up.

+        beq             1f
+        mov             r12, #32
+        @ Only load the top 4 lines, and only do it for the later slices.
+        @ For the first slice, d16-d19 is kept in registers from the first 
pass.


the comment is oddly placed, please move it above the cmp #


Indeed, fixed.

+function idct16x16_quarter_add_neon
+        push            {r4-r9,lr}
+
+        @ Align the stack, allocate a temp buffer
+T       mov             r7,  sp
+T       and             r7,  r7,  #15
+A       and             r7,  sp,  #15
+        add             r7,  r7,  #512
+        sub             sp,  sp,  r7
+
+        mov             r4,  r0
+        mov             r5,  r1
+        mov             r6,  r2
+
+        movrel          r12, idct_coeffs
+        vld1.16         {q0-q1}, [r12,:128]
+
+.irp i, 0

not needed, unless idct16x16_half_add and idct16x16_quarter_add aretemplated using a macro. The first .irp is still annoying though.

I would probably use a '.ifc \size, quarter ...; .else ...; .endif'
Has the full idct the same format?

Yes, pretty much, but the full idct also can have iadst and has got abunch of other details (e.g. skipping steps in the first pass whenchecking eob). Templating in this over there would IMO make it too messy,but it's pretty easy to do both quarter and half with one template.

I'm not really sure which variant I prefer. Is the speed differencemesuable for idct heavy real world samples? If you have preference forone or the other variant I trust your judgement.

It's measurable, but it's not much. For one sample, I originally got afull decode time like this (fastest time out of 2 runs) with the currentmaster:

user    2m53.980s
Alternative 1:
user    2m53.448s
Alternative 2:
user    2m52.952s

So alternative 2 is better, but produces a couple KB bigger binaries, andmore duplicated code. (OTOH also allowing more exact special casing ofminor details.)

I originally clearly preferred alt 2, but with your suggestions for alt 1,the diff for that one ends up very small and neat.

I'll think some more about it and see which one I prefer. Perhaps alt 1still is the best tradeoff between code duplication, diff size and speed.


// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 2/5] arm: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 (alternative 2)

Reply via email to