On 21 November 2012 09:20, Zhenqiang Chen <zhenqiang.c...@linaro.org> wrote:
> On 21 November 2012 03:26, Michael Hope <michael.h...@linaro.org> wrote:
>> On 20 November 2012 22:10, Zhenqiang Chen <zhenqiang.c...@linaro.org> wrote:
>>> Hi,
>>>
>>> I try ARM, MIPS, PowerPC and X86 on povray benchmark. No one can
>>> shrink-wrap function Ray_In_Bound.
>>>
>>> Here is:
>>> bool Ray_In_Bound (RAY *Ray, OBJECT *Bounding_Object)
>>> {
>>>   ...
>>>   for (Bound = Bounding_Object; Bound != NULL; Bound = Bound->Sibling)
>>>   {...}
>>>   return (true);
>>> }
>>> For ARM O2/O3, "Bound" is allocated to "r6" during ira. So there is copy
>>>
>>> r6 = r1 before
>>> testing Bound != NULL
>>
>> Could you hack the benchmark to make the early exit explicit and see
>> if that changes the result?  That lets us know if improving shrink
>> wrap is worthwhile.
>>
>> Something like:
>>
>>  bool Ray_In_Bound (RAY *Ray, OBJECT *Bounding_Object)
>>  {
>>   if (Bounding_Object == NULL) return true;
>
> I had tried it. The result is the same with the original one. (The
> hack code is optimized)

After hacking the assemble code, I got 2-3% performance improvement
for -O2. Here is the assemble change
Original code:
        push    {r4, r5, r6, r7, r8, r9, lr}
        .save {r4, r5, r6, r7, r8, r9, lr}
        mov     r6, r1
        .pad #196
        sub     sp, sp, #196
        cbz     r1, .L113
        ldr     r8, .L117
        ...
.L113:
        movs    r0, #1
        add     sp, sp, #196
        @ sp needed
        pop     {r4, r5, r6, r7, r8, r9, pc}

After shrink-wrap:
        cbz     r1, .L1131
        push    {r4, r5, r6, r7, r8, r9, lr}
        .save {r4, r5, r6, r7, r8, r9, lr}
        mov     r6, r1
        .pad #196
        sub     sp, sp, #196
        ldr     r8, .L117
        ...
.L113:
        movs    r0, #1
        add     sp, sp, #196
        @ sp needed
        pop     {r4, r5, r6, r7, r8, r9, pc}
.L1131:
        movs    r0, #1
        bx      lr

But simple hack for -O3 has ~1% regression. "code alignment" change
should be the root cause. To verify it, I add 6 NOPs after "bx lr".
With it, the size of block .L1131 is 16 Bytes. After this change, O3
will have 2-3% performance improvement.

-Zhenqiang

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to