A small case is attached to reproduce it. Here are logs for different loop header alignment (default is 64): linaro@Linaro-test:~$ gcc test1.c -o t.exe && time ./t.exe
real 0m3.206s user 0m3.203s sys 0m0.000s linaro@Linaro-test:~$ gcc test1.c -DALIGNED_2 -o t.exe && time ./t.exe real 0m2.898s user 0m2.875s sys 0m0.016s linaro@Linaro-test:~$ gcc test1.c -DALIGNED_4 -o t.exe && time ./t.exe real 0m2.851s user 0m2.844s sys 0m0.008s linaro@Linaro-test:~$ gcc test1.c -DALIGNED_8 -o t.exe && time ./t.exe real 0m3.167s user 0m3.156s sys 0m0.000s Thanks! -Zhenqiang On 23 August 2012 10:09, Michael Hope <michael.h...@linaro.org> wrote: > Zhenqiang's been working on the later split 2 patch which causes more > constants to be built using a movw/movt instead of a constant pool > load. There was an unexpected ~10 % regression in one benchmark which > seems to be due to function alignment. I think we've tracked down the > reason but not the action. > > Compared to the baseline, the split2 branch took 113 % of the time to > run, i.e. 13 % longer. Adding an explicit 16 byte alignment to the > function changed this to 97 % of the time, i.e. 3 % faster. The > reason Zhenqiang and I got different results was the build-id. He > used the binary build scripts to make the cross compiler, which turn > on the build ID, which added an extra 20 bytes ahead of .text, which > happened to align the function to 16 bytes. cbuild doesn't use the > build-id (although it should) which happened to align the function to > an 8 byte boundary. > > The disassembly is identical so I assume the regression is cache or > fast loop related. I'm not sure what to do, so let's talk about this > at the next performance call. > > -- Michael > > _______________________________________________ > linaro-toolchain mailing list > linaro-toolchain@lists.linaro.org > http://lists.linaro.org/mailman/listinfo/linaro-toolchain
test1.asm
Description: Binary data
volatile float a,b,c; int __attribute__ ((aligned(64))) main() { int i,j; for (j = 0; j < 4; j++) { __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); /******loop header is 64 Byte aligned */ #ifdef ALIGNED_2 __asm__ ("nop"); #endif #ifdef ALIGNED_4 __asm__ ("nop"); __asm__ ("nop"); #endif #ifdef ALIGNED_8 __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); __asm__ ("nop"); #endif for (i = j; i <65535000; i++) { a = b + c; } } return 0; }
_______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain