Hi everyone,

I've noticed a bit of a deep potential optimisation for faster code:

    jne     .Lj1806
    movb    $13,-40(%rbp)
    cmpq    $0,-32(%rbp)
    je      .Lj1798
    ...
.Lj1806:
    movb    $1,-40(%rbp)
.Lj1798:
    cmpb    $1,-40(%rbp)
    jne     .Lj1812

If you analyse the jumps and stack values carefully (which could be considered registers for sake of simplicity), you can make an assumption and shortcut a jump in the program flow:

- Prior to "je .Lj1798", the value of -40(%rbp) is set to 13.
- If the program flow jumps here, "cmp $1,-40(%rbp)" sets flags such that "jne .Lj1812" will definitely branch, because -40(%rbp) is equal to 13 and not 1. - Therefore, the original "je .Lj1798" branch can be redirected to "je .Lj1812"

In other words:

    jne     .Lj1806
    movb    $13,-40(%rbp)
    cmpq    $0,-32(%rbp)
    je      .Lj1812
    ...
.Lj1806:
    movb    $1,-40(%rbp)
.Lj1798:
    cmpb    $1,-40(%rbp)
    jne     .Lj1812

This block behaves exactly the same way, but we've removed a comparison and a conditional jump from the program flow, which will almost certainly result in a speed increase due to fewer instructions being executed and less pressure on the branch predictor.

The question and challenge here though... how could one implement this kind of assembly-level data-flow analysis? Stopping at "je .Lj1812" jumping to the label and then looking ahead to see "cmpb $1,-40(%rbp); jne .Lj1812" is expensive enough, but then also looking backwards to find a matching MOV instruction is even more expensive.  Similarly, looking forward from "movb $13,-40(%rbp)" and analysing every conditional jump you come across is rather prohibitive too.  If that is hard enough, look at the block of code from aasmcpu.s with the ... filled in:

    movb    $13,-40(%rbp)
    cmpq    $0,-32(%rbp)
    je      .Lj1798
    movq    -16(%rbp),%rcx
    call    AASMCPU$_$TAICPU_$__$$_INSEND$$LONGINT
    movslq  %eax,%rax
    movq    -8(%rbp),%rdx
    movq    24(%rdx),%rdx
    subq    56(%rdx),%rax
    movzbl  -63(%rbp),%edx
    subq    %rdx,%rax
    subq    %rax,-24(%rbp)
    jmp     .Lj1798
    .p2align 4,,10
    .p2align 3
.Lj1806:
    movb    $1,-40(%rbp)
.Lj1798:
    cmpb    $1,-40(%rbp)
    jne     .Lj1812

(Note that the "cmpb $1,-40(%rbp)" instruction is caused by an optimisation that I'm currently developing and won't appear as such yet - it currently appears as "movzbl -40(%rbp),%eax; cmpl $1,%eax")

The same logic applies to "jmp .Lj1798", in that -40(%rbp) is still equal to 13 and can be redirected to "jmp .Lj1812" (assuming AASMCPU$_$TAICPU_$__$$_INSEND$$LONGINT isn't malformed).  In fact, for this particular procedure, every jump to .Lj1798 sets -40(%rbp) to something other than 1 almost immediately prior; it only gets set to 1 immediately after label .Lj1806, so logically, .Lj1798 could become a dead label, and then "movb $1,-40(%rbp); cmpb $1,-40(%rbp)" are now adjacent and deterministic, meaning the "jne .Lj1812" will never branch and can be removed.  But putting that aside for the moment, how can one track the value of -40(%rbp) through all those instructions to the jump, especially through a call?

Of course, it could probably be brute-forced, but that could easily be on the order of O(n²) if not worse.  Some conservative assumptions can be made to terminate a search early, like if %rbp or %rsp (whichever is applicable) changes value or a write to general memory occurs (unless it's the stack), since the register that's dereferenced could easily be made to equal the stack pointer.

Let theorycrafting begin!

Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to