On Sat, 4 Feb 2017, Janne Grunau wrote:
On 2016-12-01 11:26:58 +0200, Martin Storsjö wrote:
+ @ Store the transposed 4x4 blocks horizontally.
+ @ The first 4x4 block is kept in registers for the second pass,
+ @ store the rest in the temp buffer.
+ add r0, r0, #8
+.irp i, 20, 24, 28
+ vst1.16 {d\i}, [r0,:64]!
+.endr
just writing the .irp out would use the same number of lines and is imho
clearer, the same apply below
Hmm, indeed. This also exists in the current code, so I sent a separate
patch fixing that up.
+ beq 1f
+ mov r12, #32
+ @ Only load the top 4 lines, and only do it for the later slices.
+ @ For the first slice, d16-d19 is kept in registers from the first
pass.
the comment is oddly placed, please move it above the cmp #
Indeed, fixed.
+function idct16x16_quarter_add_neon
+ push {r4-r9,lr}
+
+ @ Align the stack, allocate a temp buffer
+T mov r7, sp
+T and r7, r7, #15
+A and r7, sp, #15
+ add r7, r7, #512
+ sub sp, sp, r7
+
+ mov r4, r0
+ mov r5, r1
+ mov r6, r2
+
+ movrel r12, idct_coeffs
+ vld1.16 {q0-q1}, [r12,:128]
+
+.irp i, 0
not needed, unless idct16x16_half_add and idct16x16_quarter_add are
templated using a macro. The first .irp is still annoying though.
I would probably use a '.ifc \size, quarter ...; .else ...; .endif'
Has the full idct the same format?
Yes, pretty much, but the full idct also can have iadst and has got a
bunch of other details (e.g. skipping steps in the first pass when
checking eob). Templating in this over there would IMO make it too messy,
but it's pretty easy to do both quarter and half with one template.
I'm not really sure which variant I prefer. Is the speed difference
mesuable for idct heavy real world samples? If you have preference for
one or the other variant I trust your judgement.
It's measurable, but it's not much. For one sample, I originally got a
full decode time like this (fastest time out of 2 runs) with the current
master:
user 2m53.980s
Alternative 1:
user 2m53.448s
Alternative 2:
user 2m52.952s
So alternative 2 is better, but produces a couple KB bigger binaries, and
more duplicated code. (OTOH also allowing more exact special casing of
minor details.)
I originally clearly preferred alt 2, but with your suggestions for alt 1,
the diff for that one ends up very small and neat.
I'll think some more about it and see which one I prefer. Perhaps alt 1
still is the best tradeoff between code duplication, diff size and speed.
// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel