Div by 3 has a very elegant implementation.  In 32-bit, it's

movl    $0xAAAAAAAB,%eax
mull    Input ; answer is in %edx:%eax
shrl    $1,%edx
movl    %edx, Result

As Marģers discovered, you can reduce the cycle count and minimise pipeline stalls by extending to 64-bit and applying the shift to the reciprocal, since the smaller input domain permits a larger rounding error before it starts affecting the result:

movq    $0x5555555555555556, %rax
mulq    Input ; answer is in %rdx:%rax.  Since input is a 32-bit register, the upper 32 bits of its 64-bit counterpart are zero.
movl    %edx, Result

Now, 7 (in fact, any value of the form 7 × 2^k, where k ≥ 0) is a bit different because there's no magic number and shift value that will work for the entire 32-bit range; you need an extra bit of precision.  To simulate this, the input is added to the high word of the result, as if there were a 1 in the 33rd bit of the reciprocal.

movl    $0x24924925,%eax
mull    input
addl    input,%edx
rcrl    $1,%edx
shrl    $2,%edx
movl    %edx, Result

If the add operation overflows, you need to use rcrl (rotate with carry) to cycle the carry bit back into the result as you shift right, then you can shift right normally to get the result you need.

In 64-bit, this problem vanishes because you can insert that extra bit of precision as you shift the magic number - this sequence is about twice as fast, not just due to the reduced number of operations, but the reduction in pipeline stalls from each operation waiting on the previous one to complete (i.e. addl input,%edx cannot execute until the multiplication is finished, rcrl $1,%edx has to wait on addl, shrl $2,%edx has to wait on rcrl, and movl %edx, Result has to wait on shrl):

movq    0x2492492492492493
mulq   input
movl    %edx, Result

Modern x86 processors do out-of-order execution and can delay "movl %edx, Result" until it's actually needed, executing any instructions after it in parallel while the multiplication completes.  Unfortunately I received some test failures in tmoddiv4 and tmoddiv6 so I'm not out of the woods yet, and I'll have to see what's going on with those before I can continue.

Gareth aka. Kit

On 03/09/2021 19:56, Ched via fpc-devel wrote:
Very interesting, Gareth !

Is the div-by-7 related to 2 to the 3rd ? If yes, is it possible to design a div-by-3 with similar magics ?

Cheers, Ched'



Le 03.09.21 à 15:35, J. Gareth Moreton via fpc-devel a écrit :
Hey Marģers,

So I've been experimenting with your suggestion, and it looks like a resounding success!  I added some new tests to the "bdiv" bench test to see how it performs.  16-bit multiplications don't get improved as well as they could be on x86_64 because the intermediate values are all extended to 32-bit, but you can still see a massive improvement on all the unsigned divisions by 7.

I'm currently running the test suite for i386-win32 and x86_64-win64, then will implement the same system for AArch64. Thanks so much for spotting the improvement.

Gareth aka. Kit

P.S. I still need to work on the signed modulus at some point.


_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to