On 22 January 2018 at 08:07, John Naylor <jcnay...@gmail.com> wrote: > On 1/21/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote: >> It occurs to me that maybe a better test to exclude a value from the >> MCV list would be to demand that its relative standard error not be >> too high. Such a test, in addition to the existing tests, might be >> sufficient to solve the opposite problem of too many values in the MCV >> list, because the real problem there is including a value after having >> seen relatively few occurrences of it in the sample, and thus having a >> wildly inaccurate estimate for it. Setting a bound on the relative >> standard error would mean that we could have a reasonable degree of >> confidence in estimates produced from the sample. > > If you don't mind, what would the math look like for that?
Using the same syntax as before: N = Total rows in table (population size) n = Number of rows sampled x = Number of times a particular value appears in the sample p = x/n = Frequency of the value in the sample So that the standard error of the proportion is SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) Then the relative standard error (which is usually expressed as a percentage) is RSE = 100 * SE / p So if we were to demand that the relative standard error was less than, say, 10%, then the constraint would just be SE < 0.1 * p Note: * This formula not valid if x is very small (see what I wrote about being able to approximate the distribution of p with a normal distribution). So we should also enforce the "rule of thumb" x >= 10. * The frequency p that we're storing is the count divided by the overall sample size. So the values for N and n above should not be the counts that exclude NULLs. As far as this logic is concerned, NULLs (and too-wide values) are just values not equal to value under consideration. Thus it appears, from just a quick glance at your patch, that you're using the wrong values for N and n. * The RSE is a monotonically decreasing function of p in the range [0,1], so an upper bound on the RSE equates to a lower bound on the number of occurrences of the value. This last point gives me a different idea. Instead of applying this test *in addition to* the existing tests, how about applying it *instead of* the existing tests. That is, we keep all MCVs that appear sufficiently often that we can be reasonably confident in the estimates produced from their sample frequencies, regardless of whether or not they are more common than average (the average being fairly meaningless for a highly skewed distribution anyway). This is still keeping the most common values, but now we'd be saying that we keep any value that appears sufficiently often in the sample that we can extrapolate its sample frequency to produce a reasonably accurate estimate of the population frequency, and discard values for which the estimate is likely to be inaccurate. I have not tested this idea at all, but it seems like it might be worth considering. It has the nice property that the test depends entirely on how often the value appears, rather than on other previously computed statistics, such as Ndistinct. Doing a quick test in pure SQL, using the highly skewed distribution Jeff Janes gave in the other thread, with the default sample size of 30,000: with samples as ( select floor(-log(random())/log(2))::int as who from generate_series(1,30000) ), freqs as ( select who, count(*) as x, count(*)/30000::float8 as p from samples group by who ), stats as ( select *, sqrt(p*(1-p)/30000) * sqrt((10000000-30000)::float8/(10000000-1)) as se from freqs ) select *, (10000000*p)::int||'+/-'||(2*se*10000000)::int as "95% interval", case when x >=10 and se < 0.1*p then 'KEEP' else 'DISCARD' end from stats order by p desc limit 100; it pretty consistently keeps the 8 most common values: who | x | p | se | 95% interval | case -----+-------+----------------------+----------------------+-----------------+--------- 0 | 15017 | 0.500566666666667 | 0.00288241625942075 | 5005667+/-57648 | KEEP 1 | 7607 | 0.253566666666667 | 0.00250800597590887 | 2535667+/-50160 | KEEP 2 | 3713 | 0.123766666666667 | 0.00189844800003 | 1237667+/-37969 | KEEP 3 | 1855 | 0.0618333333333333 | 0.0013884757600711 | 618333+/-27770 | KEEP 4 | 914 | 0.0304666666666667 | 0.000990788179299791 | 304667+/-19816 | KEEP 5 | 448 | 0.0149333333333333 | 0.000699194759916533 | 149333+/-13984 | KEEP 6 | 229 | 0.00763333333333333 | 0.000501741670388358 | 76333+/-10035 | KEEP 7 | 108 | 0.0036 | 0.000345267009604061 | 36000+/-6905 | KEEP 8 | 46 | 0.00153333333333333 | 0.000225565173744715 | 15333+/-4511 | DISCARD 9 | 34 | 0.00113333333333333 | 0.000193963300230354 | 11333+/-3879 | DISCARD 10 | 17 | 0.000566666666666667 | 0.000137191663419411 | 5667+/-2744 | DISCARD 11 | 11 | 0.000366666666666667 | 0.000110367969704201 | 3667+/-2207 | DISCARD 13 | 1 | 3.33333333333333e-05 | 3.32827427148958e-05 | 333+/-666 | DISCARD (13 rows) and if you increase the stats target to 1000 (sample size 300,000) then it pretty consistently keeps the 11 most common values. So, in this very limited test at least, it seems to help with Jeff's problem, and it's no longer the case that cranking up the default statistics target has a weak effect on the number of MCVs kept. In the case of a uniform distribution, this might end up keeping all the values seen, but that's not necessarily wrong -- if they are seen often enough to make the MCV estimates accurate, why not keep them? Of course this would need a lot more testing with other data distributions, and I've not given any thought to the existing histogram bounds test. Also the 10% RSE bound is fairly arbitrary, and could be tweaked, but perhaps this is not an entirely crazy idea. Regards, Dean