On Sat, 4 Feb 2017, Janne Grunau wrote:

On 2016-12-01 11:26:58 +0200, Martin Storsjö wrote:
+        @ Store the transposed 4x4 blocks horizontally.
+        @ The first 4x4 block is kept in registers for the second pass,
+        @ store the rest in the temp buffer.
+        add             r0,  r0,  #8
+.irp i, 20, 24, 28
+        vst1.16         {d\i}, [r0,:64]!
+.endr

just writing the .irp out would use the same number of lines and is imho clearer, the same apply below

Hmm, indeed. This also exists in the current code, so I sent a separate patch fixing that up.

+        beq             1f
+        mov             r12, #32
+        @ Only load the top 4 lines, and only do it for the later slices.
+        @ For the first slice, d16-d19 is kept in registers from the first 
pass.

the comment is oddly placed, please move it above the cmp #

Indeed, fixed.

+function idct16x16_quarter_add_neon
+        push            {r4-r9,lr}
+
+        @ Align the stack, allocate a temp buffer
+T       mov             r7,  sp
+T       and             r7,  r7,  #15
+A       and             r7,  sp,  #15
+        add             r7,  r7,  #512
+        sub             sp,  sp,  r7
+
+        mov             r4,  r0
+        mov             r5,  r1
+        mov             r6,  r2
+
+        movrel          r12, idct_coeffs
+        vld1.16         {q0-q1}, [r12,:128]
+
+.irp i, 0

not needed, unless idct16x16_half_add and idct16x16_quarter_add are templated using a macro. The first .irp is still annoying though.
I would probably use a '.ifc \size, quarter ...; .else ...; .endif'
Has the full idct the same format?

Yes, pretty much, but the full idct also can have iadst and has got a bunch of other details (e.g. skipping steps in the first pass when checking eob). Templating in this over there would IMO make it too messy, but it's pretty easy to do both quarter and half with one template.

I'm not really sure which variant I prefer. Is the speed difference mesuable for idct heavy real world samples? If you have preference for one or the other variant I trust your judgement.

It's measurable, but it's not much. For one sample, I originally got a full decode time like this (fastest time out of 2 runs) with the current master:
user    2m53.980s
Alternative 1:
user    2m53.448s
Alternative 2:
user    2m52.952s

So alternative 2 is better, but produces a couple KB bigger binaries, and more duplicated code. (OTOH also allowing more exact special casing of minor details.)

I originally clearly preferred alt 2, but with your suggestions for alt 1, the diff for that one ends up very small and neat.

I'll think some more about it and see which one I prefer. Perhaps alt 1 still is the best tradeoff between code duplication, diff size and speed.

// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to