Tim Peters <[email protected]> added the comment:
I was thinking about
comb(1000000, 500000)
The simple "* --n / ++k" loop does 499,999 each of multiplication and division,
and in all instances the second operand is a single Python digit. Cheap as can
be.
In contrast, despite that it short-circuits all "small k" cases, comb_pole2.py
executes
return C(n, j) * C(n-j, k-j) // C(k, j) # Recursive case
7,299,598 times. Far more top-level arithmetic operations, including
extra-expensive long division with both operands multi-digit. At the very top
of the tree, "// C(500000, 250000)" is dividing out nearly half a million bits.
But a tiny Python function coding the simple loop takes about 150 seconds,
while Raymond's C() about 30 (under the released 3.10.1). The benefits from
provoking Karatsuba are major.
Just for the heck of it, I coded a complication of the simple loop that just
tries to provoke Karatsuba on the numerator (prod(range(n, n-k, -1))), and
dividing that just once at the end, by factorial(k). That dropped the time to
about 17.5 seconds. Over 80% of that time is spent computing the "return"
expression, where len(p) is 40 and only 7 entries pass the x>1 test:
return prod(x for x in p if x > 1) // factorial(k)
That is, with really big results, almost everything is mostly noise compared to
the time needed to do the relative handful of * and // on the very largest
intermediate results.
Stefan's code runs much faster still, but requires storage proportional to k.
The crude hack I used needs a fixed (independent of k and n) and small amount
of temp storage, basically grouping inputs into buckets roughly based on the
log _of_ their log, and keeping only one int per bucket.
Here's the code:
def pcomb2(n, k):
from math import prod, factorial
assert 0 <= k <= n
k = min(k, n-k)
PLEN = 40 # good enough for ints with over a trillion bits
p = [1] * PLEN
def fold_into_p(x):
if x == 1:
return
while True:
i = max(x.bit_length().bit_length() - 5, 0)
if p[i] == 1:
p[i] = x
break
x *= p[i]
p[i] = 1
def showp():
for i in range(PLEN):
pi = p[i]
if pi > 1:
print(i, pi.bit_length())
for i in range(k):
fold_into_p(n)
n -= 1
showp()
return prod(x for x in p if x > 1) // factorial(k)
I'm not sure it's practical. For example, while the list p[] can be kept quite
small, factorial(k) can require a substantial amount of temp storage of its own
- factorial(500000) in the case at hand is an int approaching 9 million bits.
Note: again in the case at hand, computing it via
factorial(1000000) // factorial(500000)**2
takes about 10% longer than Raymond's C() function.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37295>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com