Tim Peters <t...@python.org> added the comment:

I was thinking about

comb(1000000, 500000)

The simple "* --n / ++k" loop does 499,999 each of multiplication and division, 
and in all instances the second operand is a single Python digit. Cheap as can 
be.

In contrast, despite that it short-circuits all "small k" cases, comb_pole2.py 
executes

    return C(n, j) * C(n-j, k-j) // C(k, j)               # Recursive case

7,299,598 times. Far more top-level arithmetic operations, including 
extra-expensive long division with both operands multi-digit. At the very top 
of the tree, "// C(500000, 250000)" is dividing out nearly half a million bits.

But a tiny Python function coding the simple loop takes about 150 seconds, 
while Raymond's C() about 30 (under the released 3.10.1). The benefits from 
provoking Karatsuba are major.

Just for the heck of it, I coded a complication of the simple loop that just 
tries to provoke Karatsuba on the numerator (prod(range(n, n-k, -1))), and 
dividing that just once at the end, by factorial(k). That dropped the time to 
about 17.5 seconds. Over 80% of that time is spent computing the "return" 
expression, where len(p) is 40 and only 7 entries pass the x>1 test:

    return prod(x for x in p if x > 1) // factorial(k)

That is, with really big results, almost everything is mostly noise compared to 
the time needed to do the relative handful of * and // on the very largest 
intermediate results.

Stefan's code runs much faster still, but requires storage proportional to k. 
The crude hack I used needs a fixed (independent of k and n) and small amount 
of temp storage, basically grouping inputs into buckets roughly based on the 
log _of_ their log, and keeping only one int per bucket.

Here's the code:

    def pcomb2(n, k):
        from math import prod, factorial

        assert 0 <= k <= n
        k = min(k, n-k)
        PLEN = 40 # good enough for ints with over a trillion bits
        p = [1] * PLEN

        def fold_into_p(x):
            if x == 1:
                return
            while True:
                i = max(x.bit_length().bit_length() - 5, 0)
                if p[i] == 1:
                    p[i] = x
                    break
                x *= p[i]
                p[i] = 1

        def showp():
            for i in range(PLEN):
                pi = p[i]
                if pi > 1:
                    print(i, pi.bit_length())

        for i in range(k):
            fold_into_p(n)
            n -= 1
        showp()
        return prod(x for x in p if x > 1) // factorial(k)

I'm not sure it's practical. For example, while the list p[] can be kept quite 
small, factorial(k) can require a substantial amount of temp storage of its own 
- factorial(500000) in the case at hand is an int approaching 9 million bits.

Note: again in the case at hand, computing it via

    factorial(1000000) // factorial(500000)**2

takes about 10% longer than Raymond's C() function.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37295>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to