On Mon, Jan 23, 2017 at 9:41 AM, Nadav Har'El <n...@scylladb.com> wrote: > > On Mon, Jan 23, 2017 at 4:52 PM, aleba...@gmail.com <aleba...@gmail.com> wrote: >> >> 2017-01-23 15:33 GMT+01:00 Robert Kern <robert.k...@gmail.com>: >>> >>> I don't object to some Notes, but I would probably phrase it more like we are providing the standard definition of the jargon term "sampling without replacement" in the case of non-uniform probabilities. To my mind (or more accurately, with my background), "replace=False" obviously picks out the implemented procedure, and I would have been incredibly surprised if it did anything else. If the option were named "unique=True", then I would have needed some more documentation to let me know exactly how it was implemented. >>> >> FWIW, I totally agree with Robert > > With my own background (MSc. in Mathematics), I agree that this algorithm is indeed the most natural one. And as I said, when I wanted to implement something myself when I wanted to choose random combinations (k out of n items), I wrote exactly the same one. But when it didn't produce the desired probabilities (even in cases where I knew that doing this was possible), I wrongly assumed numpy would do things differently - only to realize it uses exactly the same algorithm. So clearly, the documentation didn't quite explain what it does or doesn't do.

In my experience, I have seen "without replacement" mean only one thing. If the docstring had said "returns unique items", I'd agree that it doesn't explain what it does or doesn't do. The only issue is that "without replacement" is jargon, and it is good to recapitulate the definitions of such terms for those who aren't familiar with them. > Also, Robert, I'm curious: beyond explaining why the existing algorithm is reasonable (which I agree), could you give me an example of where it is actually *useful* for sampling? The references I previously quoted list a few. One is called "multistage sampling proportional to size". The idea being that you draw (without replacement) from a larger units (say, congressional districts) before sampling within them. It is similar to the situation you outline, but it is probably more useful at a different scale, like lots of larger units (where your algorithm is likely to provide no solution) rather than a handful. It is probably less useful in terms of survey design, where you are trying to *design* a process to get a result, than it is in queueing theory and related fields, where you are trying to *describe* and simulate a process that is pre-defined. -- Robert Kern

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion