On Mon, Jan 23, 2017 at 6:27 AM, Anne Archibald <peridot.face...@gmail.com> wrote: > > On Wed, Jan 18, 2017 at 4:13 PM Nadav Har'El <n...@scylladb.com> wrote: >> >> On Wed, Jan 18, 2017 at 4:30 PM, <josef.p...@gmail.com> wrote: >>> >>>> Having more sampling schemes would be useful, but it's not possible to implement sampling schemes with impossible properties. >>> >>> BTW: sampling 3 out of 3 without replacement is even worse >>> >>> No matter what sampling scheme and what selection probabilities we use, we always have every element with probability 1 in the sample. >> >> I agree. The random-sample function of the type I envisioned will be able to reproduce the desired probabilities in some cases (like the example I gave) but not in others. Because doing this correctly involves a set of n linear equations in comb(n,k) variables, it can have no solution, or many solutions, depending on the n and k, and the desired probabilities. A function of this sort could return an error if it can't achieve the desired probabilities. > > It seems to me that the basic problem here is that the numpy.random.choice docstring fails to explain what the function actually does when called with weights and without replacement. Clearly there are different expectations; I think numpy.random.choice chose one that is easy to explain and implement but not necessarily what everyone expects. So the docstring should be clarified. Perhaps a Notes section: > > When numpy.random.choice is called with replace=False and non-uniform probabilities, the resulting distribution of samples is not obvious. numpy.random.choice effectively follows the procedure: when choosing the kth element in a set, the probability of element i occurring is p[i] divided by the total probability of all not-yet-chosen (and therefore eligible) elements. This approach is always possible as long as the sample size is no larger than the population, but it means that the probability that element i occurs in the sample is not exactly p[i].
I don't object to some Notes, but I would probably phrase it more like we are providing the standard definition of the jargon term "sampling without replacement" in the case of non-uniform probabilities. To my mind (or more accurately, with my background), "replace=False" obviously picks out the implemented procedure, and I would have been incredibly surprised if it did anything else. If the option were named "unique=True", then I would have needed some more documentation to let me know exactly how it was implemented. -- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion