]Mark Dickinson <dicki...@gmail.com>]
>> Division may still be problematic.

Heh. I'm half convinced that heavy duty bigint packages are so often
written in assembler because their authors are driven insane by trying
to trick C compilers into generating "the obvious" machine
instructions needed.

An alternative to HW division is to multiply by a scaled integer
reciprocal instead, but it's very delicate to get all cases right. GMP
heavyweights Niels Moller and Torbjorn Granlund wrote up the most
efficient I've seen of this ilk in their paper "Improved division by
invariant integers".

It requires a single x single -> double multiply, and another but only
the low bits from that one, to get the right quotient and remainder.

Ironically, it would run faster if CPython used all 32 bits of its
internal digits - some operations in their algorithm are expected to
overflow at times (including + and -!), andi it's crucial that they be
done modulo the base in use. That happens "by magic" if the base
matches the unsigned type's width.

They only need a single-width scaled reciprocal, which can be computed
with a HW double / single -> single divide if available, or via
excruciating division-free scaled integer Newton refinement. Of course
the cost of computing the reciprocal once pays off each time the same
denominator is used.

Curiously,

"""
It is curious that on these processors, the combination of our
reciprocal algorithm (Alg. 2) and division algorithm (Alg. 4) is
significantly faster than the built in assembler instruction
for 2/1 division.
"""

HW division may have gotten faster since then, though.

> On that note: Python divisions are somewhat crippled even on x64. Assuming
> 30-bit digits, the basic building block that's needed for multi-precision 
> division
> is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient 
> (and
> ideally also a 32-bit remainder).

Which is an instance of what they mean by "2/1 division" above.

> ...
> IOW, forcing use of DIVL instead of DIVQ, in combination with getting the 
> remainder
> directly from the DIV instruction instead of computing it separately, gives a 
> 41%
> speedup in this particular worst case. I'd expect the effect to be even more 
> marked
> on x86, but haven't yet done those timings.

> ...
> I don't know whether we really want to open the door to using inline assembly
> for performance reasons in longobject.c, but it's interesting to see the 
> effect.

Indeed it is! But changing the problem to use multiplication instead
may be both faster and more portable, albeit at the cost of subtler
code.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QGCHH6B7HCQRARPPVELVROO6FMWQY3G4/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to