Re: [fpc-devel] The "magic div" algorithm

J. Gareth Moreton via fpc-devel Fri, 03 Sep 2021 13:12:50 -0700

Div by 3 has a very elegant implementation.  In 32-bit, it's


movl    $0xAAAAAAAB,%eax
mull    Input ; answer is in %edx:%eax
shrl    $1,%edx
movl    %edx, Result

As Marģers discovered, you can reduce the cycle count and minimisepipeline stalls by extending to 64-bit and applying the shift to thereciprocal, since the smaller input domain permits a larger roundingerror before it starts affecting the result:


movq    $0x5555555555555556, %rax

mulq Input ; answer is in %rdx:%rax. Since input is a 32-bitregister, the upper 32 bits of its 64-bit counterpart are zero.

movl    %edx, Result

Now, 7 (in fact, any value of the form 7 × 2^k, where k ≥ 0) is a bitdifferent because there's no magic number and shift value that will workfor the entire 32-bit range; you need an extra bit of precision. Tosimulate this, the input is added to the high word of the result, as ifthere were a 1 in the 33rd bit of the reciprocal.


movl    $0x24924925,%eax
mull    input
addl    input,%edx
rcrl    $1,%edx
shrl    $2,%edx
movl    %edx, Result

If the add operation overflows, you need to use rcrl (rotate with carry)to cycle the carry bit back into the result as you shift right, then youcan shift right normally to get the result you need.

In 64-bit, this problem vanishes because you can insert that extra bitof precision as you shift the magic number - this sequence is abouttwice as fast, not just due to the reduced number of operations, but thereduction in pipeline stalls from each operation waiting on the previousone to complete (i.e. addl input,%edx cannot execute until themultiplication is finished, rcrl $1,%edx has to wait on addl, shrl$2,%edx has to wait on rcrl, and movl %edx, Result has to wait on shrl):


movq    0x2492492492492493
mulq   input
movl    %edx, Result

Modern x86 processors do out-of-order execution and can delay "movl%edx, Result" until it's actually needed, executing any instructionsafter it in parallel while the multiplication completes. UnfortunatelyI received some test failures in tmoddiv4 and tmoddiv6 so I'm not out ofthe woods yet, and I'll have to see what's going on with those before Ican continue.


Gareth aka. Kit

On 03/09/2021 19:56, Ched via fpc-devel wrote:

Very interesting, Gareth !
Is the div-by-7 related to 2 to the 3rd ? If yes, is it possible todesign a div-by-3 with similar magics ?
Cheers, Ched'



Le 03.09.21 à 15:35, J. Gareth Moreton via fpc-devel a écrit :
Hey Marģers,
So I've been experimenting with your suggestion, and it looks like aresounding success! I added some new tests to the "bdiv" bench testto see how it performs. 16-bit multiplications don't get improved aswell as they could be on x86_64 because the intermediate values areall extended to 32-bit, but you can still see a massive improvementon all the unsigned divisions by 7.
I'm currently running the test suite for i386-win32 and x86_64-win64,then will implement the same system for AArch64. Thanks so much forspotting the improvement.
Gareth aka. Kit

P.S. I still need to work on the signed modulus at some point.
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] The "magic div" algorithm

Reply via email to