On Sun, Jan 2, 2022 at 10:35 AM Mark Dickinson <dicki...@gmail.com> wrote:

> Division may still be problematic.
>

On that note: Python divisions are somewhat crippled even on x64. Assuming
30-bit digits, the basic building block that's needed for multi-precision
division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit
quotient (and ideally also a 32-bit remainder). And there's an x86/x64
instruction that does exactly that, namely DIVL. But without using inline
assembly, current versions of GCC and Clang apparently can't be persuaded
to emit that instruction from the longobject.c source - they'll use DIVQ (a
128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set
to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.

I was curious to find out what the potential impact of the failure to use
DIVL was, so I ran some timings. A worst-case target is division of a large
(multi-digit) integer by a single-digit integer (where "digit" means digit
in the sense of PyLong digit, not decimal digit), since that involves
multiple CPU division instructions in a fairly tight loop.

Results: on my laptop (2.7 GHz Intel Core i7-8559U, macOS 10.14.6,
non-optimised non-debug Python build), a single division of 10**1000 by 10
takes ~1018ns on the current main branch and ~722ns when forced to use the
DIVL instruction (by inserting inline assembly into the inplace_divrem1
function). IOW, forcing use of DIVL instead of DIVQ, in combination
with getting the remainder directly from the DIV instruction instead of
computing it separately, gives a 41% speedup in this particular worst case.
I'd expect the effect to be even more marked on x86, but haven't yet done
those timings.

For anyone who wants to play along, here's the implementation of the
inplace_divrem1 (in longobject.c) that I was using:

static digit
inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n)
{
digit remainder = 0;

assert(n > 0 && n <= PyLong_MASK);
while (--size >= 0) {
twodigits dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size];
digit quotient, high, low;
high = (digit)(dividend >> 32);
low = (digit)dividend;
__asm__("divl %2\n"
: "=a" (quotient), "=d" (remainder)
: "r" (n), "a" (low), "d" (high)
);
pout[size] = quotient;
}
return remainder;
}


I don't know whether we *really* want to open the door to using inline
assembly for performance reasons in longobject.c, but it's interesting to
see the effect.

-- 
Mark
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZWGPO3TMCI7WNLC3EMS26DIKI5D3ZWMK/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to