[9fans] when is a branch not a branch?

ron minnich Mon, 02 Jun 2008 15:02:28 -0700

Here's an interesting thing I'm learning about the kind of
optimizations that might be in the EU of the newer machines. This
started out as a pretty simple 'measure the branch penalty' exercise.


Given this program:

int b(int i){
        return i+1;
}

main(){
        int i;
        for(i = 0; i < 1000000000; i = b(i));
        if (i != 1000000000)
                printf("Fix me\n");
}

The assembly code on newer micros looks like this:
       .file   "sa.c"
        .text
.globl b
        .type   b, @function
b:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        addl    $1, %eax
        popl    %ebp
        ret
        .size   b, .-b
        .section        .rodata
.LC0:
        .string "Fix me"
        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $20, %esp
        movl    $0, -8(%ebp)
        jmp     .L4
.L5:
        movl    -8(%ebp), %eax
        movl    %eax, (%esp)
        call    b
        movl    %eax, -8(%ebp)
.L4:
        cmpl    $999999999, -8(%ebp)
        jle     .L5
        cmpl    $1000000000, -8(%ebp)
        je      .L10
        movl    $.LC0, (%esp)
        call    puts
.L10:
        addl    $20, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .ident  "GCC: (GNU) 4.1.2 20070925 (Red Hat 4.1.2-27)"
        .section        .note.GNU-stack,"",@progbits

OK, let's modify b: a little as follows;
b:
#if JMP > 0
        jmp     bb
#if JMP > 1
                call    b
#if JMP > 2
                call    b
#if JMP > 3
                call    b
#endif
#endif
#endif
#endif
bb:

So, simple: if JMP is 0, no change, if JMP is 1, we do (in essence) br
.+2, if JMP is 2, we do br .+7, etc.

now I time the run 10 times (I can run longer but it seems good enough
to establish behavior). I should get some rough idea of the cost of
the branch.

Attachment 1 shows this cpu running in 32-bit mode:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.40GHz
stepping        : 4
cpu MHz         : 3415.468
cache size      : 1024 KB

the short form: the penalty for the branch is zero. As I say, I can
run it until I get getter statistics, but the trend is clear. Note
this was an unloaded machine.

Well, how about my laptop? The penalty for the branch is negative. Our
guess: the very highly dynamic power management is playing tricks with
our minds. But we don't know. But the code with the branch is
consistently 25% faster.
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping        : 6
cpu MHz         : 1000.000
cache size      : 4096 KB

how about a 64-bit cpu?

cpu family      : 15
model           : 3
model name      :                   Intel(R) Xeon(TM) CPU 3.40GHz
stepping        : 4
cpu MHz         : 3400.283
cache size      : 1024 KB

third graph. Here you can see the penalty increased a bit as the size
of the jmp increased.

Don't try this with 8a. 8a is too damn smart -- it just optimizes the
branch out (unless there is a switch to avoid doing that). You need a
dumb assembler.
ron

<<attachment: prism.gif>>

<<attachment: results.gif>>

<<attachment: runtimes.64bit.gif>>

[9fans] when is a branch not a branch?

Reply via email to