I suppose in truth, I can, and that in itself is probably fairly cross-platform (although I'll stick with x86 for the moment and get that working).  Sometimes the simple solution eludes me!  Is there anything I need to take into account when it comes to range checking (that is, if a third party tries to compile a unit with range checking enabled), since "numerator * $AAAAAAAB" when constrained to 32 bits will almost always overflow?

Gareth aka. Kit

On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
Hi everyone,

I'm looking at ways to optimise div and mod, starting with x86 and then probably AArch64.  The obvious one is attempting to merge "Q := N div D; R := N mod D;", where D is a variable (but invariant between the two instructions), since DIV returns the quotient in R/EAX and the remainder in R/EDX in a single operation, or converting the latter equation to "R := N - (Q * D);" if D is a constant.

However, inspired somewhat by "Hacker's Delight", I would like to first see if I can optimise the Boolean condition "(X mod C) = 0", where C is a constant.  By calculating the multiplicative reciprocal of C (it may or may not be equal to the 'magic div' constant), you can perform it with just a multiplication and a comparison - for example, when dividing by 3 and returning the remainder:

mov (numerator),%reg1
mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
imul %reg1,%reg2
cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }

If %reg2 is less than or equal to $55555555, then the numerator is an exact multiple of 3, and if it's greater, then it is not an exact multiple.  The proof for this is explained in Hacker's Delight, but relies on the fact that 3 and 2^32 are relatively prime and the exact multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the values 0 to $55555555 (if the divisor is even, which means it's not relatively prime to 2^32, you have to do a bit of trickery with a bit rotation, but done properly, it's only 1 extra instruction).

I'm trying to think of a way to make this clean and flexible, especially where future expansion is concerned.  One idea I had was to create a new platform-specific node such as "tx86divisible", which takes an integer variable (x) and an integer constant (c) and returns True if x mod c = 0, and "(X mod C) = 0" code is converted to this node via tx86addnode.simplify (the node used for comparisons), so it can be quickly converted into the optimal code in pass_generate_code.  The other option is to do this conversion in pass_generate_code, where a new node type isn't required but might be a little trickier to make cross-platform... if it's possible to make "tx86divisible" completely cross-platform - that is, have an implementation on every target - the node conversion code only has to exist in a single place, thus improving maintainability.

What do you suggest?

Can't you generate a mul and cmp node in tx86addnode.simplify which simulates this behavior?
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to