On Mon, Jan 23, 2017 at 6:27 AM, Anne Archibald <peridot.face...@gmail.com>
wrote:
>
> On Wed, Jan 18, 2017 at 4:13 PM Nadav Har'El <n...@scylladb.com> wrote:
>>
>> On Wed, Jan 18, 2017 at 4:30 PM, <josef.p...@gmail.com> wrote:
>>>
>>>> Having more sampling schemes would be useful, but it's not possible to
implement sampling schemes with impossible properties.
>>>
>>> BTW: sampling 3 out of 3 without replacement is even worse
>>>
>>> No matter what sampling scheme and what selection probabilities we use,
we always have every element with probability 1 in the sample.
>>
>> I agree. The random-sample function of the type I envisioned will be
able to reproduce the desired probabilities in some cases (like the example
I gave) but not in others. Because doing this correctly involves a set of n
linear equations in comb(n,k) variables, it can have no solution, or many
solutions, depending on the n and k, and the desired probabilities. A
function of this sort could return an error if it can't achieve the desired
probabilities.
>
> It seems to me that the basic problem here is that the
numpy.random.choice docstring fails to explain what the function actually
does when called with weights and without replacement. Clearly there are
different expectations; I think numpy.random.choice chose one that is easy
to explain and implement but not necessarily what everyone expects. So the
docstring should be clarified. Perhaps a Notes section:
>
> When numpy.random.choice is called with replace=False and non-uniform
probabilities, the resulting distribution of samples is not obvious.
numpy.random.choice effectively follows the procedure: when choosing the
kth element in a set, the probability of element i occurring is p[i]
divided by the total probability of all not-yet-chosen (and therefore
eligible) elements. This approach is always possible as long as the sample
size is no larger than the population, but it means that the probability
that element i occurs in the sample is not exactly p[i].

I don't object to some Notes, but I would probably phrase it more like we
are providing the standard definition of the jargon term "sampling without
replacement" in the case of non-uniform probabilities. To my mind (or more
accurately, with my background), "replace=False" obviously picks out the
implemented procedure, and I would have been incredibly surprised if it did
anything else. If the option were named "unique=True", then I would have
needed some more documentation to let me know exactly how it was
implemented.

--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to