Re: [Numpy-discussion] Multivariate hypergeometric distribution?

2012-07-02 Thread josef . pktd
On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez fperez@gmail.com wrote:
 Hi all,

 in recent work with a colleague, the need came up for a multivariate
 hypergeometric sampler; I had a look in the numpy code and saw we have
 the bivariate version, but not the multivariate one.

 I had a look at the code in scipy.stats.distributions, and it doesn't
 look too difficult to add a proper multivariate hypergeometric by
 extending the bivariate code, with one important caveat: the hard part
 is the implementation of the actual discrete hypergeometric sampler,
 which lives inside of numpy/random/mtrand/distributions.c:

 https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743

 That code is hand-written C, and it only works for the bivariate case
 right now.  It doesn't look terribly difficult to extend, but it will
 certainly take a bit of care and testing to ensure all edge cases are
 handled correctly.

My only foray into this

http://projects.scipy.org/numpy/ticket/921
http://projects.scipy.org/numpy/ticket/923

This looks difficult to add without a good reference and clear
description of the algorithm.


 Does anyone happen to have that implemented lying around, in a form
 that would be easy to merge to add this capability to numpy?

not me, I have never even heard of multivariate hypergeometric distribution.


maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf  p.11
with some properties http://www.math.uah.edu/stat/urn/MultiHypergeometric.html

I've seen one other algorithm, that seems to need N (number of draws
in hypergeom) random variables for one multivariate hypergeometric
random draw, which seems slow to me.

But maybe someone has it lying around.

Josef


 Thanks,

 f
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Multivariate hypergeometric distribution?

2012-07-02 Thread josef . pktd
On Mon, Jul 2, 2012 at 8:08 PM,  josef.p...@gmail.com wrote:
 On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez fperez@gmail.com wrote:
 Hi all,

 in recent work with a colleague, the need came up for a multivariate
 hypergeometric sampler; I had a look in the numpy code and saw we have
 the bivariate version, but not the multivariate one.

 I had a look at the code in scipy.stats.distributions, and it doesn't
 look too difficult to add a proper multivariate hypergeometric by
 extending the bivariate code, with one important caveat: the hard part
 is the implementation of the actual discrete hypergeometric sampler,
 which lives inside of numpy/random/mtrand/distributions.c:

 https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743

 That code is hand-written C, and it only works for the bivariate case
 right now.  It doesn't look terribly difficult to extend, but it will
 certainly take a bit of care and testing to ensure all edge cases are
 handled correctly.

 My only foray into this

 http://projects.scipy.org/numpy/ticket/921
 http://projects.scipy.org/numpy/ticket/923

 This looks difficult to add without a good reference and clear
 description of the algorithm.


 Does anyone happen to have that implemented lying around, in a form
 that would be easy to merge to add this capability to numpy?

 not me, I have never even heard of multivariate hypergeometric distribution.


 maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf  p.11
 with some properties http://www.math.uah.edu/stat/urn/MultiHypergeometric.html

 I've seen one other algorithm, that seems to need N (number of draws
 in hypergeom) random variables for one multivariate hypergeometric
 random draw, which seems slow to me.

 But maybe someone has it lying around.

Now I have a pure num/sci/python version around.

A bit more than an hour, so no guarantees, but freq and pmf look close enough.

Josef


 Josef


 Thanks,

 f
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
# -*- coding: utf-8 -*-


Created on Mon Jul 02 20:23:08 2012

Author: Josef Perktold


import numpy as np

n_balls = [5, 3, 4]
n_sample = 5
size = 10


p = len(n_balls)
n_all = sum(n_balls)
rvs = np.zeros((size, p), int)

n_bad = n_all
n_remain = n_sample * np.ones(size, int)
for ii in range(p-1):
n_good = n_balls[ii]
n_bad = n_bad - n_good
rvs_ii = rvs[:,ii]
mask = n_remain = 1
need = mask.sum()
rvs_ii[mask] = np.random.hypergeometric(n_good, n_bad, n_remain[mask], 
size=need)
rvs[:,ii] = rvs_ii
n_remain = np.maximum(n_remain - rvs_ii, 0) 

rvs[:, -1] = n_sample - rvs.sum(1)
#print rvs
print rvs.mean(0) * n_all / n_sample
u, idx = np.unique(rvs.view([('', int)]*3), return_inverse=True)
u_arr = u.view(int).reshape(len(u), -1)
count = np.bincount(idx) 
freq = count * 1. / len(idx)

from scipy.misc import comb

def pmf(x, n_balls):
x = np.asarray(x)
n_balls = np.asarray(n_balls)
#p = len(n_balls)
ret = np.product(comb(n_balls, x)) / comb(n_balls.sum(), x.sum())
return ret

print
print freq
for x,fr in zip(u_arr, freq):
th = pmf(x, n_balls)
print x, np.round(th, 5),  fr, np.round(fr - th, 10)


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Multivariate hypergeometric distribution?

2012-07-02 Thread Skipper Seabold
On Mon, Jul 2, 2012 at 9:35 PM, josef.p...@gmail.com wrote:

 On Mon, Jul 2, 2012 at 8:08 PM,  josef.p...@gmail.com wrote:
  On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez fperez@gmail.com
 wrote:
  Hi all,
 
  in recent work with a colleague, the need came up for a multivariate
  hypergeometric sampler; I had a look in the numpy code and saw we have
  the bivariate version, but not the multivariate one.
 
  I had a look at the code in scipy.stats.distributions, and it doesn't
  look too difficult to add a proper multivariate hypergeometric by
  extending the bivariate code, with one important caveat: the hard part
  is the implementation of the actual discrete hypergeometric sampler,
  which lives inside of numpy/random/mtrand/distributions.c:
 
 
 https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743
 
  That code is hand-written C, and it only works for the bivariate case
  right now.  It doesn't look terribly difficult to extend, but it will
  certainly take a bit of care and testing to ensure all edge cases are
  handled correctly.
 
  My only foray into this
 
  http://projects.scipy.org/numpy/ticket/921
  http://projects.scipy.org/numpy/ticket/923
 
  This looks difficult to add without a good reference and clear
  description of the algorithm.
 
 
  Does anyone happen to have that implemented lying around, in a form
  that would be easy to merge to add this capability to numpy?
 
  not me, I have never even heard of multivariate hypergeometric
 distribution.
 
 
  maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf  p.11
  with some properties
 http://www.math.uah.edu/stat/urn/MultiHypergeometric.html
 
  I've seen one other algorithm, that seems to need N (number of draws
  in hypergeom) random variables for one multivariate hypergeometric
  random draw, which seems slow to me.
 
  But maybe someone has it lying around.

 Now I have a pure num/sci/python version around.

 A bit more than an hour, so no guarantees, but freq and pmf look close
 enough.


I could be wrong, but I think PyMC has sampling and likelihood.

Skipper
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Multivariate hypergeometric distribution?

2012-07-02 Thread Fernando Perez
On Mon, Jul 2, 2012 at 7:31 PM, Skipper Seabold jsseab...@gmail.com wrote:
 I could be wrong, but I think PyMC has sampling and likelihood.

It appears you're right!

http://pymc-devs.github.com/pymc/distributions.html?highlight=hypergeometric#pymc.distributions.multivariate_hypergeometric_like

Thanks :)

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Multivariate hypergeometric distribution?

2012-07-02 Thread Fernando Perez
On Mon, Jul 2, 2012 at 7:49 PM, Fernando Perez fperez@gmail.com wrote:
 It appears you're right!

 http://pymc-devs.github.com/pymc/distributions.html?highlight=hypergeometric#pymc.distributions.multivariate_hypergeometric_like

Furthermore, the code actually calls a sampler implemented in Fortran:

http://pymc-devs.github.com/pymc/_modules/pymc/distributions.html#multivariate_hypergeometric_like

which calls this:

https://github.com/pymc-devs/pymc/blob/master/pymc/flib.f#L4379

Thanks again to both of you for the help.  It's always productive to
ask around here ;)

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Multivariate hypergeometric distribution?

2012-07-02 Thread josef . pktd
On Mon, Jul 2, 2012 at 10:53 PM, Fernando Perez fperez@gmail.com wrote:
 On Mon, Jul 2, 2012 at 7:49 PM, Fernando Perez fperez@gmail.com wrote:
 It appears you're right!

nice idea:

https://github.com/pymc-devs/pymc/blob/master/pymc/distributions.py#L1670



 http://pymc-devs.github.com/pymc/distributions.html?highlight=hypergeometric#pymc.distributions.multivariate_hypergeometric_like

 Furthermore, the code actually calls a sampler implemented in Fortran:

 http://pymc-devs.github.com/pymc/_modules/pymc/distributions.html#multivariate_hypergeometric_like

 which calls this:

 https://github.com/pymc-devs/pymc/blob/master/pymc/flib.f#L4379

 Thanks again to both of you for the help.  It's always productive to
 ask around here ;)

the loglikelihood function is just some calls to scipy.special.gammaln

I changed my version to get better numerical precision

#taken from scipy.misc but returns log of comb
def log_comb(n, k):
from scipy import special
from scipy.special import gammaln
k = np.asarray(k)
n = np.asarray(n)
cond = (k = n)  (n = 0)  (k = 0)
sv = special.errprint(0)
vals = gammaln(n + 1) - gammaln(n - k + 1) - gammaln(k + 1)
sv = special.errprint(sv)
return np.where(cond, vals, -np.inf)

def log_pmf(x, n_balls):
x = np.asarray(x)
n_balls = np.asarray(n_balls)
ret = np.sum(log_comb(n_balls, x)) - log_comb(n_balls.sum(), x.sum())
return ret

def pmf(x, n_balls):
x = np.asarray(x)
n_balls = np.asarray(n_balls)
ret = np.sum(log_comb(n_balls, x)) - log_comb(n_balls.sum(), x.sum())
ret = np.exp(ret)
return ret


proof by picture
https://picasaweb.google.com/106983885143680349926/Joepy#5760777856741667746
https://picasaweb.google.com/106983885143680349926/Joepy#5760777861891130962

Cheers,

Josef


 Cheers,

 f
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion