Hi everyone,

I'm playing around with some node-level optimisations because I've noticed that some Int64 operations on 32-bit platforms can be sped up if the arithmetic is changed slightly.  For example, when performing x + x with Int64s, it's faster to do x shl 1, especially if x is on the stack.

On i386, the instructions boil down to, for example, "addl %eax,%eax" and "shll $1,%eax", or maybe "movl (ref),%edx; leal (%edx,%edx),%eax" versus "movl (ref),%eax; shll $1,%eax".  In terms of instruction size, both "addl %eax,%eax" and "shll $1,%eax" instructions take 2 bytes to encode.  In more complex situations, shift instructions seem to be friendlier for the peephole optimizer - for example, in sfpu128:

    movl    12(%ebp),%edx
    leal    (%edx,%edx),%eax
    testl   %eax,%eax
    seteb   %al

With a shift instead of an add:

    movl    12(%ebp),%eax
    shll    $1,%eax
    seteb   %al

Of course, with the first example, the peephole optimizer or the code generator could be more efficient since %edx is not reused and it could just be encoded as "movl 12(%ebp),%edx; addl %eax,%eax; seteb %al", since addl sets the flags and hence the testl instruction can be dropped.  Where a 64-bit ordinal is concerned though, take a look at this snippet when a 64-bit integer is added to itself:

    movl    8(%ebp),%edx
    movl    12(%ebp),%eax
    addl    8(%ebp),%edx
    adcl    12(%ebp),%eax

And with a shift:

    movl    8(%ebp),%edx
    movl    12(%ebp),%eax
    shldl   $1,%edx,%eax
    shll    $1,%edx

In this case, we don't have to read from the stack twice and we also drop dependency on the carry flag, which might permit some stronger optimisations if, say, it's known that the upper 32 bits are equal to zero, and even if the value of %edx is not known, the dependency chain is broken as long as register aliasing is used to prevent the write-after-read pipeline stall between "shldl $1,%edx,%eax" and "shll %1,%edx", something which cannot be done between "addl 8(%ebp),%edx" and "adcl 12(%ebp),%eax" because of the carry flag depending on the result of "addl".

However, in general, when dealing with ordinals that fit into a CPU word, is it better to perform x + x or x shl 1? (I'm assuming x * 2 is generally worse).  Not just for i386, but also for ARM-32, for example.  (I'm also assuming that range and overflow checking is turned off - granted, shl does set the carry flag if the last bit that gets shifted out was a 1, so "shll $1,%eax" and "addl %eax,%eax" both set the carry flag if the most significant bit of %eax is set, but this is only for i386).

Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to