Re: [fpc-devel] Division optimisations

J. Gareth Moreton via fpc-devel Sat, 11 Sep 2021 22:22:45 -0700

So I've got some pretty good headway so far!

Trunk:

Unsigned 32-bit (n mod 3) = 0 - Pass - average iterationduration: 0.757 ns Signed 32-bit (n mod 3) = 0 - Pass - average iterationduration: 6.403 ns Unsigned 32-bit (n mod 10) = 0 - Pass - average iterationduration: 0.698 ns Signed 32-bit (n mod 10) = 0 - Pass - average iterationduration: 6.461 ns Unsigned 32-bit (n mod 100) = 0 - Pass - average iterationduration: 0.931 ns Signed 32-bit (n mod 100) = 0 - Pass - average iterationduration: 6.286 ns Unsigned 32-bit (n mod 400) = 0 - Pass - average iterationduration: 0.990 ns Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iterationduration: 1.048 ns Unsigned 64-bit (n mod 3) = 0 - Pass - average iterationduration: 0.698 ns Signed 64-bit (n mod 3) = 0 - Pass - average iterationduration: 6.403 ns Unsigned 64-bit (n mod 10) = 0 - Pass - average iterationduration: 0.757 ns Signed 64-bit (n mod 10,000) = 0 - Pass - average iterationduration: 6.403 ns Unsigned 64-bit (n mod 100) = 0 - Pass - average iterationduration: 0.990 ns Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iterationduration: 6.286 ns Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iterationduration: 0.990 ns


New algorithm:

Unsigned 32-bit (n mod 3) = 0 - Pass - average iterationduration: 0.524 ns Signed 32-bit (n mod 3) = 0 - Pass - average iterationduration: 0.640 ns Unsigned 32-bit (n mod 10) = 0 - Pass - average iterationduration: 0.698 ns Signed 32-bit (n mod 10) = 0 - Pass - average iterationduration: 0.815 ns Unsigned 32-bit (n mod 100) = 0 - Pass - average iterationduration: 0.640 ns Signed 32-bit (n mod 100) = 0 - Pass - average iterationduration: 0.640 ns Unsigned 32-bit (n mod 400) = 0 - Pass - average iterationduration: 0.582 ns Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iterationduration: 0.582 ns Unsigned 64-bit (n mod 3) = 0 - Pass - average iterationduration: 0.640 ns Signed 64-bit (n mod 3) = 0 - Pass - average iterationduration: 0.815 ns Unsigned 64-bit (n mod 10) = 0 - Pass - average iterationduration: 0.815 ns Signed 64-bit (n mod 10,000) = 0 - Pass - average iterationduration: 0.873 ns Unsigned 64-bit (n mod 100) = 0 - Pass - average iterationduration: 0.815 ns Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iterationduration: 0.873 ns Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iterationduration: 0.757 ns

I tend to shave off a few fractions of a nanosecond for unsigned modulusoperations, while signed modulus operations, which still use IDIVinternally due to some awkwardness with how moduli are calculated (whichstill needs to be resolved), the saving is absolutely massive.

64-bit is a little slower than 32-bit possibly because the code size isquite large due to there being 3 different 64-bit constants that need tobe loaded (for 32-bit and under, these constants can be directly encodedin the individual instructions as immediates). Since the 3rd constantis just the 2nd one having been bit-shifted, a better approach would beto temporarily store the second constant in a register and then shift itat the same time as another mathematical operation (thus using two ALUports to execute them simultaneously). As specified in Hacker'sDelight, this is also the recommended approach for RISC processors suchas AArch64 where encoding 64-bit constants takes up to 4 instructions. This, however, would require the use of a tempref, something I'm stillresearching.

Under x86_64, the use of a shift instead of a load could be done using apeephole optimisation. For example, the code generated for signed64-bit "(n mod 3) = 0" is currently:


    movq    %rax,%r8
    movq    $-6148914691236517205,%r11
    imulq    %r11,%r8
    movq    $3074457345618258602,%r11
    addq    %r11,%r8
    movq    $6148914691236517204,%r11
    cmpq    %r11,%r8

With a small addition to DeepMOVOpt, the peephole optimizer could easilyspot that 6148914691236517204 is exactly double 3074457345618258602, andchange the second mov instruction to "shlq $1,%r11", which only requires3 bytes to store (4 if the shift is something other than 1), compared to"movq $6148914691236517204,%r11" which requires 10. As long as theoriginal value is used at some point (in this case, via "addq%r11,%r8"), it takes the same number of cycles to execute.

I'm waiting until my last division patches are uploaded before openingan issue because I'm adding quite a few new tests to tests/bench/bdivand I want to minimise merge conflicts. That and I still need to testhow range checks affect the compilation, since the internalmultiplications deliberately overflow.


Gareth aka. Kit


On 10/09/2021 21:59, J. Gareth Moreton via fpc-devel wrote:

I suppose in truth, I can, and that in itself is probably fairlycross-platform (although I'll stick with x86 for the moment and getthat working). Sometimes the simple solution eludes me! Is thereanything I need to take into account when it comes to range checking(that is, if a third party tries to compile a unit with range checkingenabled), since "numerator * $AAAAAAAB" when constrained to 32 bitswill almost always overflow?
Gareth aka. Kit

On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
Hi everyone,
I'm looking at ways to optimise div and mod, starting with x86 andthen probably AArch64. The obvious one is attempting to merge "Q :=N div D; R := N mod D;", where D is a variable (but invariantbetween the two instructions), since DIV returns the quotient inR/EAX and the remainder in R/EDX in a single operation, orconverting the latter equation to "R := N - (Q * D);" if D is aconstant.
However, inspired somewhat by "Hacker's Delight", I would like tofirst see if I can optimise the Boolean condition "(X mod C) = 0",where C is a constant. By calculating the multiplicative reciprocalof C (it may or may not be equal to the 'magic div' constant), youcan perform it with just a multiplication and a comparison - forexample, when dividing by 3 and returning the remainder:
mov (numerator),%reg1
mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
imul %reg1,%reg2
cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
If %reg2 is less than or equal to $55555555, then the numerator isan exact multiple of 3, and if it's greater, then it is not an exactmultiple. The proof for this is explained in Hacker's Delight, butrelies on the fact that 3 and 2^32 are relatively prime and theexact multiples of 3 multiplied by 3's reciprocal modulo 2^32 maponto the values 0 to $55555555 (if the divisor is even, which meansit's not relatively prime to 2^32, you have to do a bit of trickerywith a bit rotation, but done properly, it's only 1 extra instruction).
I'm trying to think of a way to make this clean and flexible,especially where future expansion is concerned. One idea I had wasto create a new platform-specific node such as "tx86divisible",which takes an integer variable (x) and an integer constant (c) andreturns True if x mod c = 0, and "(X mod C) = 0" code is convertedto this node via tx86addnode.simplify (the node used forcomparisons), so it can be quickly converted into the optimal codein pass_generate_code. The other option is to do this conversion inpass_generate_code, where a new node type isn't required but mightbe a little trickier to make cross-platform... if it's possible tomake "tx86divisible" completely cross-platform - that is, have animplementation on every target - the node conversion code only hasto exist in a single place, thus improving maintainability.
What do you suggest?
Can't you generate a mul and cmp node in tx86addnode.simplify whichsimulates this behavior?
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division optimisations

Reply via email to