----- Forwarded Message ----- From: Lee Rhodes <[email protected]>To: Eric Bax <[email protected]>Sent: Friday, October 18, 2019, 03:38:37 PM PDTSubject: Re: Sampling for sketches Eric, this is great. But in the spirit of openness, would you mind moving this discussion to either [email protected] or to our Apache slack workspace: the-asf.slack.com / channel: datasketches-dev. Thanks, Lee.
On Fri, Oct 18, 2019 at 3:23 PM Eric Bax <[email protected]> wrote: Thanks for the links. I see that it really is like selecting a sample of a specified size. In that case, I think hypergeometric bounds apply. Letn = population size.m = number of "successes" in population (e.g. number of people who visited the site 15 days last month.)s = sample size (e.g. number of users in our little sketch sample)k = number of "successes" in sample. Then the probability of seeing k / s in our sample, given m / n at large, is hypergeometric: p(n, m, s, k) = C(m, k) C(n - m, s - k) / C(n, s). So an upper bound for m, with probability of bound failure delta, is: max {m | sum(i = 0 to k) p(n, m, s, i) >= delta}, the maximum m-value that has k /s or less sample successes with probability at least delta. A lower bound is: min {m | sum(i = k to s) p(n, m, s, i) >= delta}. I'll attach some python code that computes these bounds. On Thursday, October 17, 2019, 05:28:27 PM PDT, Lee Rhodes <[email protected]> wrote: Eric, It is more like the second case. Example 1: Reservoir sampling sketch. Choose k, then pick samples from the stream of n samples such that each sample in the stream has the same probability, k/n, of ending up in the reservoir. However, if there are duplicates in the stream, you can have duplicates in the reservoir. Example 2: "Sampling" distinct values using a Theta (or KMV) Sketch (which retains sample proxies). - Remove all duplicates - Use Reservoir sampling to select your samples. This is not HOW the sketches work, but from a probability point-of-view, it should be the same. There is a very rudimentary explanation of how Theta (and KMV) sketches work on-line at https://datasketches.github.io/docs/Theta/InverseEstimate.html The paper for Theta sketches is https://github.com/DataSketches/DataSketches.github.io/blob/master/docs/pdf/ThetaSketchFramework.pdf HLL sketches are another animal altogether and don't retain samples. Lee. On Thu, Oct 17, 2019 at 4:36 PM Eric Bax <[email protected]> wrote: Is the sampling like: For each user in the population, I randomly determine whether to include them in the sample with a fixed probability? Rather than: I decide how many samples to take. Then I select that many users at random from the population? I think the former. Just want to make sure for thinking about whether samples of categories are independent. In other words union bounds would not be needed for multiple categories. Sent from Yahoo Mail for iPhone
