I suppose in truth, I can, and that in itself is probably fairly
cross-platform (although I'll stick with x86 for the moment and get that
working). Sometimes the simple solution eludes me! Is there anything I
need to take into account when it comes to range checking (that is, if a
third party tries to compile a unit with range checking enabled), since
"numerator * $AAAAAAAB" when constrained to 32 bits will almost always
overflow?
Gareth aka. Kit
On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
Hi everyone,
I'm looking at ways to optimise div and mod, starting with x86 and
then probably AArch64. The obvious one is attempting to merge "Q :=
N div D; R := N mod D;", where D is a variable (but invariant between
the two instructions), since DIV returns the quotient in R/EAX and
the remainder in R/EDX in a single operation, or converting the
latter equation to "R := N - (Q * D);" if D is a constant.
However, inspired somewhat by "Hacker's Delight", I would like to
first see if I can optimise the Boolean condition "(X mod C) = 0",
where C is a constant. By calculating the multiplicative reciprocal
of C (it may or may not be equal to the 'magic div' constant), you
can perform it with just a multiplication and a comparison - for
example, when dividing by 3 and returning the remainder:
mov (numerator),%reg1
mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
imul %reg1,%reg2
cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
If %reg2 is less than or equal to $55555555, then the numerator is an
exact multiple of 3, and if it's greater, then it is not an exact
multiple. The proof for this is explained in Hacker's Delight, but
relies on the fact that 3 and 2^32 are relatively prime and the exact
multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the
values 0 to $55555555 (if the divisor is even, which means it's not
relatively prime to 2^32, you have to do a bit of trickery with a bit
rotation, but done properly, it's only 1 extra instruction).
I'm trying to think of a way to make this clean and flexible,
especially where future expansion is concerned. One idea I had was
to create a new platform-specific node such as "tx86divisible", which
takes an integer variable (x) and an integer constant (c) and returns
True if x mod c = 0, and "(X mod C) = 0" code is converted to this
node via tx86addnode.simplify (the node used for comparisons), so it
can be quickly converted into the optimal code in
pass_generate_code. The other option is to do this conversion in
pass_generate_code, where a new node type isn't required but might be
a little trickier to make cross-platform... if it's possible to make
"tx86divisible" completely cross-platform - that is, have an
implementation on every target - the node conversion code only has to
exist in a single place, thus improving maintainability.
What do you suggest?
Can't you generate a mul and cmp node in tx86addnode.simplify which
simulates this behavior?
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel