On Mon, Jan 23, 2017 at 4:52 PM, aleba...@gmail.com <aleba...@gmail.com> wrote:
> > > 2017-01-23 15:33 GMT+01:00 Robert Kern <robert.k...@gmail.com>: > >> >> I don't object to some Notes, but I would probably phrase it more like we >> are providing the standard definition of the jargon term "sampling without >> replacement" in the case of non-uniform probabilities. To my mind (or more >> accurately, with my background), "replace=False" obviously picks out the >> implemented procedure, and I would have been incredibly surprised if it did >> anything else. If the option were named "unique=True", then I would have >> needed some more documentation to let me know exactly how it was >> implemented. >> >> FWIW, I totally agree with Robert > With my own background (MSc. in Mathematics), I agree that this algorithm is indeed the most natural one. And as I said, when I wanted to implement something myself when I wanted to choose random combinations (k out of n items), I wrote exactly the same one. But when it didn't produce the desired probabilities (even in cases where I knew that doing this was possible), I wrongly assumed numpy would do things differently - only to realize it uses exactly the same algorithm. So clearly, the documentation didn't quite explain what it does or doesn't do. Also, Robert, I'm curious: beyond explaining why the existing algorithm is reasonable (which I agree), could you give me an example of where it is actually *useful* for sampling? Let me give you an illustrative counter-example: Let's imagine a country that a country has 3 races: 40% Lilliputians, 40% Blefuscans, an 20% Yahoos (immigrants from a different section of the book ;-)). Gulliver wants to take a poll, and needs to sample people from all these races with appropriate proportions. These races live in different parts of town, so to pick a random person he needs to first pick one of the races and then a random person from that part of town. If he picks one respondent at a time, he uses numpy.random.choice(3, size=1,p=[0.4,0.4,0.2])) to pick the part of town, and then a person from that part - he gets the desired 40% / 40% / 20% division of races. Now imagine that Gulliver can interview two respondents each day, so he needs to pick two people each time. If he picks 2 choices of part-of-town *with* replacement, numpy.random.choice(3, size=2,p=[0.4,0.4,0.2]), that's also fine: he may need to take two people from the same part of town, or two from two different parts of town, but in any case will still get the desired 40% / 40% / 20% division between the races of the people he interviews. But consider that we are told that if two people from the same race meet in Gulliver's interview room, the two start chatting between themselves, and waste Gulliver's time. So he prefers to interview two people of *different* races. That's sampling without replacement. So he uses numpy.random.choice(size=2,p=[0.4,0.4,0.2],replace=False) to pick two different parts of town, and one person from each. But then he looks at his logs, and discovers he actually interviewed the races at 38% / 38% / 23% proportions - not the 40%/40%/20% he wanted. So the opinions of the Yahoos were over-counted in this poll! I know that this is a silly example (made even sillier by the names of races I used), but I wonder if you could give me an example where the current behavior of replace=False is genuinely useful. Not that I'm saying that fixing this problem is easy (I'm still struggling with it myself in the general case of size < n-1). Nadav.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion