On 21 January 2018 at 07:26, John Naylor <jcnay...@gmail.com> wrote: > I spent a few hours hacking on this, and it turns out calculating the > right number of MCVs taking into account both uniform and highly > non-uniform distributions is too delicate a problem for me to solve > right now. The logic suggested by Dean Rasheed in [1] always produces > no MCVs for a perfectly uniform distribution (which is good), but very > often also for other distributions, which is not good. My efforts to > tweak that didn't work, so I didn't get as far as adapting it for the > problem Jeff is trying to solve.
Hmm, Tom suggested that the test based on the average frequency over all values might be too strict because the estimated number of distinct values is often too low, so that might explain what you're seeing. It occurs to me that maybe a better test to exclude a value from the MCV list would be to demand that its relative standard error not be too high. Such a test, in addition to the existing tests, might be sufficient to solve the opposite problem of too many values in the MCV list, because the real problem there is including a value after having seen relatively few occurrences of it in the sample, and thus having a wildly inaccurate estimate for it. Setting a bound on the relative standard error would mean that we could have a reasonable degree of confidence in estimates produced from the sample. Regards, Dean