So I've got some pretty good headway so far!

Trunk:

              Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.757 ns                 Signed 32-bit (n mod 3) = 0 - Pass - average iteration duration: 6.403 ns              Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.698 ns                Signed 32-bit (n mod 10) = 0 - Pass - average iteration duration: 6.461 ns             Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.931 ns               Signed 32-bit (n mod 100) = 0 - Pass - average iteration duration: 6.286 ns             Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration duration: 0.990 ns           Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration duration: 1.048 ns               Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.698 ns                 Signed 64-bit (n mod 3) = 0 - Pass - average iteration duration: 6.403 ns              Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration duration: 0.757 ns            Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration duration: 6.403 ns             Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration duration: 0.990 ns        Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration duration: 6.286 ns   Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration duration: 0.990 ns

New algorithm:

              Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.524 ns                 Signed 32-bit (n mod 3) = 0 - Pass - average iteration duration: 0.640 ns              Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.698 ns                Signed 32-bit (n mod 10) = 0 - Pass - average iteration duration: 0.815 ns             Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.640 ns               Signed 32-bit (n mod 100) = 0 - Pass - average iteration duration: 0.640 ns             Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration duration: 0.582 ns           Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration duration: 0.582 ns               Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.640 ns                 Signed 64-bit (n mod 3) = 0 - Pass - average iteration duration: 0.815 ns              Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration duration: 0.815 ns            Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration duration: 0.873 ns             Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration duration: 0.815 ns        Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration duration: 0.873 ns   Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration duration: 0.757 ns

I tend to shave off a few fractions of a nanosecond for unsigned modulus operations, while signed modulus operations, which still use IDIV internally due to some awkwardness with how moduli are calculated (which still needs to be resolved), the saving is absolutely massive.

64-bit is a little slower than 32-bit possibly because the code size is quite large due to there being 3 different 64-bit constants that need to be loaded (for 32-bit and under, these constants can be directly encoded in the individual instructions as immediates).  Since the 3rd constant is just the 2nd one having been bit-shifted, a better approach would be to temporarily store the second constant in a register and then shift it at the same time as another mathematical operation (thus using two ALU ports to execute them simultaneously).  As specified in Hacker's Delight, this is also the recommended approach for RISC processors such as AArch64 where encoding 64-bit constants takes up to 4 instructions.  This, however, would require the use of a tempref, something I'm still researching.

Under x86_64, the use of a shift instead of a load could be done using a peephole optimisation.  For example, the code generated for signed 64-bit "(n mod 3) = 0" is currently:

    movq    %rax,%r8
    movq    $-6148914691236517205,%r11
    imulq    %r11,%r8
    movq    $3074457345618258602,%r11
    addq    %r11,%r8
    movq    $6148914691236517204,%r11
    cmpq    %r11,%r8

With a small addition to DeepMOVOpt, the peephole optimizer could easily spot that 6148914691236517204 is exactly double 3074457345618258602, and change the second mov instruction to "shlq $1,%r11", which only requires 3 bytes to store (4 if the shift is something other than 1), compared to "movq $6148914691236517204,%r11" which requires 10.  As long as the original value is used at some point (in this case, via "addq %r11,%r8"), it takes the same number of cycles to execute.

I'm waiting until my last division patches are uploaded before opening an issue because I'm adding quite a few new tests to tests/bench/bdiv and I want to minimise merge conflicts.  That and I still need to test how range checks affect the compilation, since the internal multiplications deliberately overflow.

Gareth aka. Kit


On 10/09/2021 21:59, J. Gareth Moreton via fpc-devel wrote:
I suppose in truth, I can, and that in itself is probably fairly cross-platform (although I'll stick with x86 for the moment and get that working).  Sometimes the simple solution eludes me!  Is there anything I need to take into account when it comes to range checking (that is, if a third party tries to compile a unit with range checking enabled), since "numerator * $AAAAAAAB" when constrained to 32 bits will almost always overflow?

Gareth aka. Kit

On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
Hi everyone,

I'm looking at ways to optimise div and mod, starting with x86 and then probably AArch64.  The obvious one is attempting to merge "Q := N div D; R := N mod D;", where D is a variable (but invariant between the two instructions), since DIV returns the quotient in R/EAX and the remainder in R/EDX in a single operation, or converting the latter equation to "R := N - (Q * D);" if D is a constant.

However, inspired somewhat by "Hacker's Delight", I would like to first see if I can optimise the Boolean condition "(X mod C) = 0", where C is a constant.  By calculating the multiplicative reciprocal of C (it may or may not be equal to the 'magic div' constant), you can perform it with just a multiplication and a comparison - for example, when dividing by 3 and returning the remainder:

mov (numerator),%reg1
mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
imul %reg1,%reg2
cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }

If %reg2 is less than or equal to $55555555, then the numerator is an exact multiple of 3, and if it's greater, then it is not an exact multiple.  The proof for this is explained in Hacker's Delight, but relies on the fact that 3 and 2^32 are relatively prime and the exact multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the values 0 to $55555555 (if the divisor is even, which means it's not relatively prime to 2^32, you have to do a bit of trickery with a bit rotation, but done properly, it's only 1 extra instruction).

I'm trying to think of a way to make this clean and flexible, especially where future expansion is concerned.  One idea I had was to create a new platform-specific node such as "tx86divisible", which takes an integer variable (x) and an integer constant (c) and returns True if x mod c = 0, and "(X mod C) = 0" code is converted to this node via tx86addnode.simplify (the node used for comparisons), so it can be quickly converted into the optimal code in pass_generate_code.  The other option is to do this conversion in pass_generate_code, where a new node type isn't required but might be a little trickier to make cross-platform... if it's possible to make "tx86divisible" completely cross-platform - that is, have an implementation on every target - the node conversion code only has to exist in a single place, thus improving maintainability.

What do you suggest?

Can't you generate a mul and cmp node in tx86addnode.simplify which simulates this behavior?
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to