[fpc-devel] Ordinal optimisation question

J. Gareth Moreton via fpc-devel Wed, 01 Dec 2021 21:36:26 -0800

Hi everyone,

I'm playing around with some node-level optimisations because I'venoticed that some Int64 operations on 32-bit platforms can be sped up ifthe arithmetic is changed slightly. For example, when performing x + xwith Int64s, it's faster to do x shl 1, especially if x is on the stack.

On i386, the instructions boil down to, for example, "addl %eax,%eax"and "shll $1,%eax", or maybe "movl (ref),%edx; leal (%edx,%edx),%eax"versus "movl (ref),%eax; shll $1,%eax". In terms of instruction size,both "addl %eax,%eax" and "shll $1,%eax" instructions take 2 bytes toencode. In more complex situations, shift instructions seem to befriendlier for the peephole optimizer - for example, in sfpu128:


    movl    12(%ebp),%edx
    leal    (%edx,%edx),%eax
    testl   %eax,%eax
    seteb   %al

With a shift instead of an add:

    movl    12(%ebp),%eax
    shll    $1,%eax
    seteb   %al

Of course, with the first example, the peephole optimizer or the codegenerator could be more efficient since %edx is not reused and it couldjust be encoded as "movl 12(%ebp),%edx; addl %eax,%eax; seteb %al",since addl sets the flags and hence the testl instruction can bedropped. Where a 64-bit ordinal is concerned though, take a look atthis snippet when a 64-bit integer is added to itself:


    movl    8(%ebp),%edx
    movl    12(%ebp),%eax
    addl    8(%ebp),%edx
    adcl    12(%ebp),%eax

And with a shift:

    movl    8(%ebp),%edx
    movl    12(%ebp),%eax
    shldl   $1,%edx,%eax
    shll    $1,%edx

In this case, we don't have to read from the stack twice and we alsodrop dependency on the carry flag, which might permit some strongeroptimisations if, say, it's known that the upper 32 bits are equal tozero, and even if the value of %edx is not known, the dependency chainis broken as long as register aliasing is used to prevent thewrite-after-read pipeline stall between "shldl $1,%edx,%eax" and "shll%1,%edx", something which cannot be done between "addl 8(%ebp),%edx" and"adcl 12(%ebp),%eax" because of the carry flag depending on the resultof "addl".

However, in general, when dealing with ordinals that fit into a CPUword, is it better to perform x + x or x shl 1? (I'm assuming x * 2 isgenerally worse). Not just for i386, but also for ARM-32, for example. (I'm also assuming that range and overflow checking is turned off -granted, shl does set the carry flag if the last bit that gets shiftedout was a 1, so "shll $1,%eax" and "addl %eax,%eax" both set the carryflag if the most significant bit of %eax is set, but this is only for i386).


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Ordinal optimisation question

Reply via email to