# Re: [HACKERS] multivariate statistics (v19)

```On Wed, Feb 08, 2017 at 03:23:25PM +0000, Dean Rasheed wrote:
> On 6 February 2017 at 21:26, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote:
> > Tomas Vondra wrote:
> >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
> >
> >> > Nearby, some auxiliary functions such as n_choose_k and
> >> > num_combinations are not documented. What it is that they do? I'd
> >> > move these at the end of the file, keeping the important entry points
> >> > at the top of the file.
> >>
> >> I'd say n-choose-k is pretty widely known term from combinatorics. The
> >> comment would essentially say just 'this is n-choose-k' which seems rather
> >> pointless. So as much as I dislike the self-documenting code, this actually
> >> seems like a good case of that.
> >
> > Actually, we do have such comments all over the place.  I knew this as
> > "n sobre k", so the english name doesn't immediately ring a bell with me
> > until I look it up; I think the function comment could just say
> > "n_choose_k -- this function returns the binomial coefficient".
>
> One of the things you have to watch out for when writing code to
> compute binomial coefficients is integer overflow, since the numerator
> and denominator get large very quickly. For example, the current code
> will overflow for n=13, k=12, which really isn't that large.
>
> This can be avoided by computing the product in reverse and using a
> larger datatype like a 64-bit integer to store a single intermediate
> result. The point about multiplying the terms in reverse is that it
> guarantees that each intermediate result is an exact integer (a
> smaller binomial coefficient), so there is no need to track separate
> numerators and denominators, and you avoid huge intermediate
> factorials. Here's what that looks like in psuedo-code:
>
> binomial(int n, int k):
>     # Save computational effort by using the symmetry of the binomial
>     # coefficients
>     k = min(k, n-k);
>
>     # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,
>     # starting from binomial(n-k, 0) = 1, and computing the sequence
>     # binomial(n-k+1, 1), binomial(n-k+2, 2), ...
>     #
>     # Note that each intermediate result is an exact integer.
>     int64 result = 1;
>     for (int i = 1; i <= k; i++)
>     {
>         result = (result * (n-k+i)) / i;
>         if (result > INT_MAX) Raise overflow error
>     }
>     return (int) result;
>
>
> Note also that I think num_combinations(n) is just an expensive way of
> calculating 2^n - n - 1.```
```
Combinations are n!/(k! * (n-k)!), so computing those is more
along the lines of:

unsigned long long
choose(unsigned long long n, unsigned long long k) {
if (k > n) {
return 0;
}
unsigned long long r = 1;
for (unsigned long long d = 1; d <= k; ++d) {
r *= n--;
r /= d;
}
return r;
}

which greatly reduces the chance of overflow.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!