[issue37295] Possible optimizations for math.comb()

Tim Peters Fri, 14 Jan 2022 10:50:42 -0800


Tim Peters <t...@python.org> added the comment:


Another trick, building on the last one: computing factorial(k) isn't cheap, in 
time or space, and neither is dividing by it. But we know it will entirely 
cancel out. Indeed, for each outer loop iteration, prod(p) is divisible by the 
current k. But, unlike as in Stefan's code, which materializes range(n, n-k, 
-1) as an explicit list, we have no way to calculate "in advance" which 
elements of p[] are divisible by what.

What we _can_ do is march over all of p[], and do a gcd of each element with 
the current k. If greater than 1, it can be divided out of both that element of 
p[], and the current k. Later, rinse, repeat - the current k must eventually be 
driven to 1 then.

But that slows things down: gcd() is also expensive.

But there's a standard trick to speed that too: as in serious implementations 
of Pollard's rho factorization method, "chunk it". That is, don't do it on 
every outer loop iteration, but instead accumulate the running product of 
several denominators first, then do the expensive gcd pass on that product.

Here's a replacement for "the main loop" of the last code that delays doing 
gcds until the running product is at least 2000 bits:

        fold_into_p(n)

        kk = 1
        for k in range(2, k+1):
            n -= 1
            # Merge into p[].
            fold_into_p(n)
            # Divide by k.
            kk *= k
            if kk.bit_length() < 2000:
                continue
            for i, pi in enumerate(p):
                if pi > 1:
                    g = gcd(pi, kk)
                    if g > 1:
                        p[i] = pi // g
                        kk //= g
                        if kk == 1:
                            break
            assert kk == 1
        showp()
        return prod(x for x in p if x > 1) // kk

That runs in under half the time (for n=1000000, k=500000), down to under 7.5 
seconds. And, of course, the largest denominator consumes only about 2000 bits 
instead of 500000!'s 8,744,448 bits.

Raising the kk bit limit from 2000 to 10000 cuts another 2.5 seconds off, down 
to about 5 seconds.

Much above that, it starts getting slower again.

Seems to hard to out-think! And highly dubious to fine-tune it based on a 
single input case ;-)

Curious: at a cutoff of 10000 bits, we're beyond the point where Karatsuba 
would have paid off for computing denominator partial products too.

----------
versions:  -Python 3.11

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37295>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue37295] Possible optimizations for math.comb()

Reply via email to