Dean Rasheed <dean.a.rash...@gmail.com> writes: > I think we should attempt to come up with a more principled approach > to this, taking into account the table and sample sizes. Here's what I > found, after a bit of research:

Thanks for doing some legwork on this! > A common initial rule of thumb is that the value should occur at least > 10 times in the sample - see, for example [1], [2]. ... > Note that this says nothing about the margin of error of p, just that > it is reasonable to treat it as having a normal distribution, which > then allows the margin of error to be analysed using standard > techniques. Got it. Still, insisting on >= 10 occurrences feels a lot better to me than insisting on >= 2. It's interesting that that can be correlated to whether the margin of error is easily analyzed. > The standard way of doing this is to calculate the "standard error" of > the sample proportion - see, for example [3], [4]: > SE = sqrt(p*(1-p)/n) > Note, however, that this formula assumes that the sample size n is > small compared to the population size N, which is not necessarily the > case. This can be taken into account by applying the "finite > population correction" (see, for example [5]), which involves > multiplying by an additional factor: > SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) It's been a long time since college statistics, but that wikipedia article reminds me that the binomial distribution isn't really the right thing for our problem anyway. We're doing sampling without replacement, so that the correct model is the hypergeometric distribution. The article points out that the binomial distribution is a good approximation as long as n << N. Can this FPC factor be justified as converting binomial estimates into hypergeometric ones, or is it ad hoc? > This gives rise to the possibility of writing another rule for candidate MCVs: > If there are Nd distinct values in the table, so that the average > frequency of occurrence of any particular value is 1/Nd, then the test > pmin > 1/Nd Interesting indeed. We'd have to be working with our estimated Nd, of course, not the true Nd. We know empirically that the estimate usually lowballs the true value unless our sample is a fairly large fraction of the whole table. That would tend to make our cutoff pmin too high, so that we'd be excluding some candidate MCVs even though a better sample would show they almost certainly had a frequency more common than average. That behavior seems fine to me; it'd tend to drop questionable MCVs in small samples, which is exactly what I think we should be doing. But it is probably worth doing some empirical experiments with this rule to see what we get. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers