Here's an interesting thing I'm learning about the kind of optimizations that might be in the EU of the newer machines. This started out as a pretty simple 'measure the branch penalty' exercise.
Given this program:
int b(int i){
return i+1;
}
main(){
int i;
for(i = 0; i < 1000000000; i = b(i));
if (i != 1000000000)
printf("Fix me\n");
}
The assembly code on newer micros looks like this:
.file "sa.c"
.text
.globl b
.type b, @function
b:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
addl $1, %eax
popl %ebp
ret
.size b, .-b
.section .rodata
.LC0:
.string "Fix me"
.text
.globl main
.type main, @function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $20, %esp
movl $0, -8(%ebp)
jmp .L4
.L5:
movl -8(%ebp), %eax
movl %eax, (%esp)
call b
movl %eax, -8(%ebp)
.L4:
cmpl $999999999, -8(%ebp)
jle .L5
cmpl $1000000000, -8(%ebp)
je .L10
movl $.LC0, (%esp)
call puts
.L10:
addl $20, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.ident "GCC: (GNU) 4.1.2 20070925 (Red Hat 4.1.2-27)"
.section .note.GNU-stack,"",@progbits
OK, let's modify b: a little as follows;
b:
#if JMP > 0
jmp bb
#if JMP > 1
call b
#if JMP > 2
call b
#if JMP > 3
call b
#endif
#endif
#endif
#endif
bb:
So, simple: if JMP is 0, no change, if JMP is 1, we do (in essence) br
.+2, if JMP is 2, we do br .+7, etc.
now I time the run 10 times (I can run longer but it seems good enough
to establish behavior). I should get some rough idea of the cost of
the branch.
Attachment 1 shows this cpu running in 32-bit mode:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Pentium(R) 4 CPU 3.40GHz
stepping : 4
cpu MHz : 3415.468
cache size : 1024 KB
the short form: the penalty for the branch is zero. As I say, I can
run it until I get getter statistics, but the trend is clear. Note
this was an unloaded machine.
Well, how about my laptop? The penalty for the branch is negative. Our
guess: the very highly dynamic power management is playing tricks with
our minds. But we don't know. But the code with the branch is
consistently 25% faster.
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz
stepping : 6
cpu MHz : 1000.000
cache size : 4096 KB
how about a 64-bit cpu?
cpu family : 15
model : 3
model name : Intel(R) Xeon(TM) CPU 3.40GHz
stepping : 4
cpu MHz : 3400.283
cache size : 1024 KB
third graph. Here you can see the penalty increased a bit as the size
of the jmp increased.
Don't try this with 8a. 8a is too damn smart -- it just optimizes the
branch out (unless there is a switch to avoid doing that). You need a
dumb assembler.
ron
<<attachment: prism.gif>>
<<attachment: results.gif>>
<<attachment: runtimes.64bit.gif>>
