Re: [Numpy-discussion] Multivariate hypergeometric distribution?
On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez fperez@gmail.com wrote: Hi all, in recent work with a colleague, the need came up for a multivariate hypergeometric sampler; I had a look in the numpy code and saw we have the bivariate version, but not the multivariate one. I had a look at the code in scipy.stats.distributions, and it doesn't look too difficult to add a proper multivariate hypergeometric by extending the bivariate code, with one important caveat: the hard part is the implementation of the actual discrete hypergeometric sampler, which lives inside of numpy/random/mtrand/distributions.c: https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743 That code is hand-written C, and it only works for the bivariate case right now. It doesn't look terribly difficult to extend, but it will certainly take a bit of care and testing to ensure all edge cases are handled correctly. My only foray into this http://projects.scipy.org/numpy/ticket/921 http://projects.scipy.org/numpy/ticket/923 This looks difficult to add without a good reference and clear description of the algorithm. Does anyone happen to have that implemented lying around, in a form that would be easy to merge to add this capability to numpy? not me, I have never even heard of multivariate hypergeometric distribution. maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf p.11 with some properties http://www.math.uah.edu/stat/urn/MultiHypergeometric.html I've seen one other algorithm, that seems to need N (number of draws in hypergeom) random variables for one multivariate hypergeometric random draw, which seems slow to me. But maybe someone has it lying around. Josef Thanks, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multivariate hypergeometric distribution?
On Mon, Jul 2, 2012 at 8:08 PM, josef.p...@gmail.com wrote: On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez fperez@gmail.com wrote: Hi all, in recent work with a colleague, the need came up for a multivariate hypergeometric sampler; I had a look in the numpy code and saw we have the bivariate version, but not the multivariate one. I had a look at the code in scipy.stats.distributions, and it doesn't look too difficult to add a proper multivariate hypergeometric by extending the bivariate code, with one important caveat: the hard part is the implementation of the actual discrete hypergeometric sampler, which lives inside of numpy/random/mtrand/distributions.c: https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743 That code is hand-written C, and it only works for the bivariate case right now. It doesn't look terribly difficult to extend, but it will certainly take a bit of care and testing to ensure all edge cases are handled correctly. My only foray into this http://projects.scipy.org/numpy/ticket/921 http://projects.scipy.org/numpy/ticket/923 This looks difficult to add without a good reference and clear description of the algorithm. Does anyone happen to have that implemented lying around, in a form that would be easy to merge to add this capability to numpy? not me, I have never even heard of multivariate hypergeometric distribution. maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf p.11 with some properties http://www.math.uah.edu/stat/urn/MultiHypergeometric.html I've seen one other algorithm, that seems to need N (number of draws in hypergeom) random variables for one multivariate hypergeometric random draw, which seems slow to me. But maybe someone has it lying around. Now I have a pure num/sci/python version around. A bit more than an hour, so no guarantees, but freq and pmf look close enough. Josef Josef Thanks, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion # -*- coding: utf-8 -*- Created on Mon Jul 02 20:23:08 2012 Author: Josef Perktold import numpy as np n_balls = [5, 3, 4] n_sample = 5 size = 10 p = len(n_balls) n_all = sum(n_balls) rvs = np.zeros((size, p), int) n_bad = n_all n_remain = n_sample * np.ones(size, int) for ii in range(p-1): n_good = n_balls[ii] n_bad = n_bad - n_good rvs_ii = rvs[:,ii] mask = n_remain = 1 need = mask.sum() rvs_ii[mask] = np.random.hypergeometric(n_good, n_bad, n_remain[mask], size=need) rvs[:,ii] = rvs_ii n_remain = np.maximum(n_remain - rvs_ii, 0) rvs[:, -1] = n_sample - rvs.sum(1) #print rvs print rvs.mean(0) * n_all / n_sample u, idx = np.unique(rvs.view([('', int)]*3), return_inverse=True) u_arr = u.view(int).reshape(len(u), -1) count = np.bincount(idx) freq = count * 1. / len(idx) from scipy.misc import comb def pmf(x, n_balls): x = np.asarray(x) n_balls = np.asarray(n_balls) #p = len(n_balls) ret = np.product(comb(n_balls, x)) / comb(n_balls.sum(), x.sum()) return ret print print freq for x,fr in zip(u_arr, freq): th = pmf(x, n_balls) print x, np.round(th, 5), fr, np.round(fr - th, 10) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multivariate hypergeometric distribution?
On Mon, Jul 2, 2012 at 9:35 PM, josef.p...@gmail.com wrote: On Mon, Jul 2, 2012 at 8:08 PM, josef.p...@gmail.com wrote: On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez fperez@gmail.com wrote: Hi all, in recent work with a colleague, the need came up for a multivariate hypergeometric sampler; I had a look in the numpy code and saw we have the bivariate version, but not the multivariate one. I had a look at the code in scipy.stats.distributions, and it doesn't look too difficult to add a proper multivariate hypergeometric by extending the bivariate code, with one important caveat: the hard part is the implementation of the actual discrete hypergeometric sampler, which lives inside of numpy/random/mtrand/distributions.c: https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743 That code is hand-written C, and it only works for the bivariate case right now. It doesn't look terribly difficult to extend, but it will certainly take a bit of care and testing to ensure all edge cases are handled correctly. My only foray into this http://projects.scipy.org/numpy/ticket/921 http://projects.scipy.org/numpy/ticket/923 This looks difficult to add without a good reference and clear description of the algorithm. Does anyone happen to have that implemented lying around, in a form that would be easy to merge to add this capability to numpy? not me, I have never even heard of multivariate hypergeometric distribution. maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf p.11 with some properties http://www.math.uah.edu/stat/urn/MultiHypergeometric.html I've seen one other algorithm, that seems to need N (number of draws in hypergeom) random variables for one multivariate hypergeometric random draw, which seems slow to me. But maybe someone has it lying around. Now I have a pure num/sci/python version around. A bit more than an hour, so no guarantees, but freq and pmf look close enough. I could be wrong, but I think PyMC has sampling and likelihood. Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multivariate hypergeometric distribution?
On Mon, Jul 2, 2012 at 7:31 PM, Skipper Seabold jsseab...@gmail.com wrote: I could be wrong, but I think PyMC has sampling and likelihood. It appears you're right! http://pymc-devs.github.com/pymc/distributions.html?highlight=hypergeometric#pymc.distributions.multivariate_hypergeometric_like Thanks :) Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multivariate hypergeometric distribution?
On Mon, Jul 2, 2012 at 7:49 PM, Fernando Perez fperez@gmail.com wrote: It appears you're right! http://pymc-devs.github.com/pymc/distributions.html?highlight=hypergeometric#pymc.distributions.multivariate_hypergeometric_like Furthermore, the code actually calls a sampler implemented in Fortran: http://pymc-devs.github.com/pymc/_modules/pymc/distributions.html#multivariate_hypergeometric_like which calls this: https://github.com/pymc-devs/pymc/blob/master/pymc/flib.f#L4379 Thanks again to both of you for the help. It's always productive to ask around here ;) Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multivariate hypergeometric distribution?
On Mon, Jul 2, 2012 at 10:53 PM, Fernando Perez fperez@gmail.com wrote: On Mon, Jul 2, 2012 at 7:49 PM, Fernando Perez fperez@gmail.com wrote: It appears you're right! nice idea: https://github.com/pymc-devs/pymc/blob/master/pymc/distributions.py#L1670 http://pymc-devs.github.com/pymc/distributions.html?highlight=hypergeometric#pymc.distributions.multivariate_hypergeometric_like Furthermore, the code actually calls a sampler implemented in Fortran: http://pymc-devs.github.com/pymc/_modules/pymc/distributions.html#multivariate_hypergeometric_like which calls this: https://github.com/pymc-devs/pymc/blob/master/pymc/flib.f#L4379 Thanks again to both of you for the help. It's always productive to ask around here ;) the loglikelihood function is just some calls to scipy.special.gammaln I changed my version to get better numerical precision #taken from scipy.misc but returns log of comb def log_comb(n, k): from scipy import special from scipy.special import gammaln k = np.asarray(k) n = np.asarray(n) cond = (k = n) (n = 0) (k = 0) sv = special.errprint(0) vals = gammaln(n + 1) - gammaln(n - k + 1) - gammaln(k + 1) sv = special.errprint(sv) return np.where(cond, vals, -np.inf) def log_pmf(x, n_balls): x = np.asarray(x) n_balls = np.asarray(n_balls) ret = np.sum(log_comb(n_balls, x)) - log_comb(n_balls.sum(), x.sum()) return ret def pmf(x, n_balls): x = np.asarray(x) n_balls = np.asarray(n_balls) ret = np.sum(log_comb(n_balls, x)) - log_comb(n_balls.sum(), x.sum()) ret = np.exp(ret) return ret proof by picture https://picasaweb.google.com/106983885143680349926/Joepy#5760777856741667746 https://picasaweb.google.com/106983885143680349926/Joepy#5760777861891130962 Cheers, Josef Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion