On Tue, Jan 17, 2017 at 6:58 PM, aleba...@gmail.com <aleba...@gmail.com> wrote:
> > > 2017-01-17 22:13 GMT+01:00 Nadav Har'El <n...@scylladb.com>: > >> >> On Tue, Jan 17, 2017 at 7:18 PM, aleba...@gmail.com <aleba...@gmail.com> >> wrote: >> >>> Hi Nadav, >>> >>> I may be wrong, but I think that the result of the current >>> implementation is actually the expected one. >>> Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and >>> 0.4 >>> >>> P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2]) >>> >> >> Yes, this formula does fit well with the actual algorithm in the code. >> But, my question is *why* we want this formula to be correct: >> >> Just a note: this formula is correct and it is one of statistics > fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability + > https://en.wikipedia.org/wiki/Bayes%27_theorem > Thus, the result we get from random.choice IMHO definitely makes sense. Of > course, I think we could always discuss about implementing other sampling > methods if they are useful to some application. > > >> >>> Now, P([1]) = 0.2 and P([2]) = 0.4. However: >>> P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability) >>> P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that, >>> once normalised, translate into 1/3 and 2/3 respectively) >>> Therefore P([1,2]) = 0.7/3 = 0.23333 >>> Similarly, P([1,3]) = 0.23333 and P([2,3]) = 1.6/3 = 0.533333 >>> >> >> Right, these are the numbers that the algorithm in the current code, and >> the formula above, produce: >> >> P([1,2]) = P([1,3]) = 0.23333 >> P([2,3]) = 0.53333 >> >> What I'm puzzled about is that these probabilities do not really fullfill >> the given probability vector 0.2, 0.4, 0.4... >> Let me try to explain explain: >> >> Why did the user choose the probabilities 0.2, 0.4, 0.4 for the three >> items in the first place? >> >> One reasonable interpretation is that the user wants in his random picks >> to see item 1 half the time of item 2 or 3. >> For example, maybe item 1 costs twice as much as item 2 or 3, so picking >> it half as often will result in an equal expenditure on each item. >> >> If the user randomly picks the items individually (a single item at a >> time), he indeed gets exactly this distribution: 0.2 of the time item 1, >> 0.4 of the time item 2, 0.4 of the time item 3. >> >> Now, what happens if he picks not individual items, but pairs of >> different items using numpy.random.choice with two items, replace=false? >> Suddenly, the distribution of the individual items in the results get >> skewed: If we look at the expected number of times we'll see each item in >> one draw of a random pair, we will get: >> >> E(1) = P([1,2]) + P([1,3]) = 0.46666 >> E(2) = P([1,2]) + P([2,3]) = 0.76666 >> E(3) = P([1,3]) + P([2,3]) = 0.76666 >> >> Or renormalizing by dividing by 2: >> >> P(1) = 0.233333 >> P(2) = 0.383333 >> P(3) = 0.383333 >> >> As you can see this is not quite the probabilities we wanted (which were >> 0.2, 0.4, 0.4)! In the random pairs we picked, item 1 was used a bit more >> often than we wanted, and item 2 and 3 were used a bit less often! >> > > p is not the probability of the output but the one of the source finite > population. I think that if you want to preserve that distribution, as > Josef pointed out, you have to make extractions independent, that is either > sample with replacement or approximate an infinite population (that is > basically the same thing). But of course in this case you will also end up > with events [X,X]. > With replacement and keeping duplicates the results might also be similar in the pattern of the marginal probabilities https://onlinecourses.science.psu.edu/stat506/node/17 Another approach in survey sampling is also to drop duplicates in with replacement sampling, but then the sample size itself is random. (again I didn't try to understand the small print) (another related aside: The problem with discrete sample space in small samples shows up also in calculating hypothesis tests, e.g. fisher's exact or similar. Because, we only get a few discrete possibilities in the sample space, it is not possible to construct a test that has exactly the desired type 1 error.) Josef > > >> So that brought my question of why we consider these numbers right. >> >> In this example, it's actually possible to get the right item >> distribution, if we pick the pair outcomes with the following probabilties: >> >> P([1,2]) = 0.2 (not 0.233333 as above) >> P([1,3]) = 0.2 >> P([2,3]) = 0.6 (not 0.533333 as above) >> >> Then, we get exactly the right P(1), P(2), P(3): 0.2, 0.4, 0.4 >> >> Interestingly, fixing things like I suggest is not always possible. >> Consider a different probability-vector example for three items - 0.99, >> 0.005, 0.005. Now, no matter which algorithm we use for randomly picking >> pairs from these three items, *each* returned pair will inevitably contain >> one of the two very-low-probability items, so each of those items will >> appear in roughly half the pairs, instead of in a vanishingly small >> percentage as we hoped. >> >> But in other choices of probabilities (like the one in my original >> example), there is a solution. For 2-out-of-3 sampling we can actually show >> a system of three linear equations in three variables, so there is always >> one solution but if this solution has components not valid as probabilities >> (not in [0,1]) we end up with no solution - as happens in the 0.99, 0.005, >> 0.005 example. >> >> >> >>> What am I missing? >>> >>> Alessandro >>> >>> >>> 2017-01-17 13:00 GMT+01:00 <numpy-discussion-requ...@scipy.org>: >>> >>>> Hi, I'm looking for a way to find a random sample of C different items >>>> out >>>> of N items, with a some desired probabilty Pi for each item i. >>>> >>>> I saw that numpy has a function that supposedly does this, >>>> numpy.random.choice (with replace=False and a probabilities array), but >>>> looking at the algorithm actually implemented, I am wondering in what >>>> sense >>>> are the probabilities Pi actually obeyed... >>>> >>>> To me, the code doesn't seem to be doing the right thing... Let me >>>> explain: >>>> >>>> Consider a simple numerical example: We have 3 items, and need to pick 2 >>>> different ones randomly. Let's assume the desired probabilities for >>>> item 1, >>>> 2 and 3 are: 0.2, 0.4 and 0.4. >>>> >>>> Working out the equations there is exactly one solution here: The random >>>> outcome of numpy.random.choice in this case should be [1,2] at >>>> probability >>>> 0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is >>>> indeed >>>> a solution for the desired probabilities because it yields item 1 in >>>> [1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] = >>>> 0.2+0.6 = 0.8 = 2*P2, etc. >>>> >>>> However, the algorithm in numpy.random.choice's replace=False >>>> generates, if >>>> I understand correctly, different probabilities for the outcomes: I >>>> believe >>>> in this case it generates [1,2] at probability 0.23333, [1,3] also >>>> 0.2333, >>>> and [2,3] at probability 0.53333. >>>> >>>> My question is how does this result fit the desired probabilities? >>>> >>>> If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333, >>>> then the expect number of "1" results we'll get per drawing is 0.23333 + >>>> 0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and >>>> for >>>> "3" 0.76666. As you can see, the proportions are off: Item 2 is NOT >>>> twice >>>> common than item 1 as we originally desired (we asked for probabilities >>>> 0.2, 0.4, 0.4 for the individual items!). >>>> >>>> >>>> -- >>>> Nadav Har'El >>>> n...@scylladb.com >>>> -------------- next part -------------- >>>> An HTML attachment was scrubbed... >>>> URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachmen >>>> ts/20170117/d1f0a1db/attachment-0001.html> >>>> >>>> ------------------------------ >>>> >>>> Subject: Digest Footer >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >>>> ------------------------------ >>>> >>>> End of NumPy-Discussion Digest, Vol 124, Issue 24 >>>> ************************************************* >>>> >>> >>> >>> >>> -- >>> ------------------------------------------------------------ >>> -------------- >>> NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may >>> contain confidential information and are intended for the sole use of the >>> recipient(s) named above. If you are not the intended recipient of this >>> message you are hereby notified that any dissemination or copying of this >>> message is strictly prohibited. If you have received this e-mail in error, >>> please notify the sender either by telephone or by e-mail and delete the >>> material from any computer. Thank you. >>> ------------------------------------------------------------ >>> -------------- >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > > -- > -------------------------------------------------------------------------- > NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain > confidential information and are intended for the sole use of the > recipient(s) named above. If you are not the intended recipient of this > message you are hereby notified that any dissemination or copying of this > message is strictly prohibited. If you have received this e-mail in error, > please notify the sender either by telephone or by e-mail and delete the > material from any computer. Thank you. > -------------------------------------------------------------------------- > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion