On Wed, Jan 18, 2017 at 4:13 PM Nadav Har'El <n...@scylladb.com> wrote:

> On Wed, Jan 18, 2017 at 4:30 PM, <josef.p...@gmail.com> wrote:
> Having more sampling schemes would be useful, but it's not possible to
> implement sampling schemes with impossible properties.
> BTW: sampling 3 out of 3 without replacement is even worse
> No matter what sampling scheme and what selection probabilities we use, we
> always have every element with probability 1 in the sample.
> I agree. The random-sample function of the type I envisioned will be able
> to reproduce the desired probabilities in some cases (like the example I
> gave) but not in others. Because doing this correctly involves a set of n
> linear equations in comb(n,k) variables, it can have no solution, or many
> solutions, depending on the n and k, and the desired probabilities. A
> function of this sort could return an error if it can't achieve the desired
> probabilities.

It seems to me that the basic problem here is that the numpy.random.choice
docstring fails to explain what the function actually does when called with
weights and without replacement. Clearly there are different expectations;
I think numpy.random.choice chose one that is easy to explain and implement
but not necessarily what everyone expects. So the docstring should be
clarified. Perhaps a Notes section:

When numpy.random.choice is called with replace=False and non-uniform
probabilities, the resulting distribution of samples is not obvious.
numpy.random.choice effectively follows the procedure: when choosing the
kth element in a set, the probability of element i occurring is p[i]
divided by the total probability of all not-yet-chosen (and therefore
eligible) elements. This approach is always possible as long as the sample
size is no larger than the population, but it means that the probability
that element i occurs in the sample is not exactly p[i].


NumPy-Discussion mailing list

Reply via email to