On Fri, 3 Feb 2017, Janne Grunau wrote:
>On 2016-12-01 11:26:57 +0200, Martin Storsjö wrote:
>>This work is sponsored by, and copyright, Google.
>>
>>@@ -668,13 +756,40 @@ function \txfm\()16_1d_4x16_pass1_neon
>>
>> mov r12, #32
>> vmov.s16 q2, #0
>>+
>>+.ifc \txfm,idct
>>+ cmp r3, #10
>>+ ble 3f
>>+ cmp r3, #38
>>+ ble 4f
>>+.endif
>
>I'd test only for less or equal 38 here
>
>>+
>> .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
>> vld1.16 {d\i}, [r2,:64]
>> vst1.16 {d4}, [r2,:64], r12
>> .endr
>>
>> bl \txfm\()16
>>+.ifc \txfm,idct
>>+ b 5f
>
>cmp r3, #10
>
>>+
>>+3:
>>+.irp i, 16, 17, 18, 19
>>+ vld1.16 {d\i}, [r2,:64]
>>+ vst1.16 {d4}, [r2,:64], r12
>>+.endr
>>+ bl idct16_quarter
>>+ b 5f
>
>remove this
>
>>+
>>+4:
>>+.irp i, 16, 17, 18, 19, 20, 21, 22, 23
>>+ vld1.16 {d\i}, [r2,:64]
>>+ vst1.16 {d4}, [r2,:64], r12
>
>.if \i == 19
>blle idct16_half
>ble 5f
>.endif
>
>saves a little binary space not sure if it's worth it.
Hmm, that looks pretty neat.
I folded in this change into the aarch64 version (and the rshrn instead of
mov) as well, using a b.gt instead of conditional bl, like this:
.if \i == 19
b.gt 4f
bl idct16_quarter
b 5f
4:
.endif
In principle I guess one could interleave the same in the full loop as well,
having only one loop, with special case checks for i == 19 and i == 23. Then
we'd end up with two comparisons instead of one when doing the full case -
not sure if it's preferrable or not.