Mark Dickinson <dicki...@gmail.com> added the comment:

It finally occurred to me that what might be killing 32-bit performance 
is the divisions, rather than the multiplications.

To test this, here's a version of 30bit_longdigit17.patch that replaces 
just *two* of the divisions in Objects/longsobject.c by the appropriate 
x86 divl assembler instruction.  The result for pydigits is an 
astonishing 10% speedup!

Results of running python pydigits_bestof.py 2000 on 32-bit OS X 
10.5.6/Core 2 Duo:

upatched py3k
-------------
Best Time; 2212.6 ms

30bit_longdigit17.patch
-----------------------
Best Time; 2283.9 ms (-3.1% relative to py3k)

30bit_longdigit17+asm.patch
---------------------------
Best Time; 2085.7 ms (+6.1% relative to py3k)

The problem is that (e.g., in the main loop of x_divrem) we're doing a 
64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit 
remainder.  From the analysis of the algorithm, *we* know that the 
quotient will always fit into 32 bits, so that e.g., on x86, a divl 
instruction is appropriate.  But unless the compiler is extraordinarily 
clever it doesn't know this, so it produces an expensive library call to 
a function that probably involves multiple divisions and/or some 
branches, that produces the full 64-bit quotient.

On 32-bit PowerPC things are even worse, since there there isn't even a 
64-by-32 bit divide instruction;  only a 32-bit by 32-bit division.

So I could still be persuaded that 30-bit digits should only be enabled 
by default on 64-bit machines...

Added file: http://bugs.python.org/file13149/30bit_longdigit17+asm.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue4258>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to