On 05/06/17 09:30, Tom Lane wrote: > First, I think we need a larger hard floor on the number of occurrences > of a value that're required to make ANALYZE decide it is a "most common > value"... > > Second, the code also has a rule that potential MCVs need to have an > estimated frequency at least 25% larger than what it thinks the "average" > value's frequency is. A rule of that general form seems like a good idea, > but I now think the 25% threshold is far too small to do anything useful. > In particular, in any case like this where there are more distinct values > than there are sample rows, the "average frequency" estimate will > correspond to less than one occurrence in the sample, so that this rule is > totally useless to filter anything that we would otherwise consider as an > MCV. I wonder if we shouldn't make it be "at least double the estimated > average frequency". >

I think we should attempt to come up with a more principled approach to this, taking into account the table and sample sizes. Here's what I found, after a bit of research: A common initial rule of thumb is that the value should occur at least 10 times in the sample - see, for example [1], [2]. The idea behind that goes as follows: Suppose that N is the population size (total number of rows in the table), and n is the sample size, and that some particular candidate value appears x times in the sample. Then the "sample proportion" is given by p = x/n. Taking a different sample would, in general, give a different value for x, and hence for the sample proportion, so repeatedly taking different random samples of the table would lead to a distribution of values of p, and in general, this distribution would be a binomial distribution, since x is the count of sampled rows matching a yes/no question (does the row match the candidate value?). The idea behind the rule of thumb above is that if n is large enough, and x is not too close to 0 or n, then the binomial distribution of p can be reasonably approximated by a normal distribution. There are various rules of thumb for this to be true, but this one is actually one of the stronger ones, as can be seen in [2]. Technically the rule should be: x >= 10 and x <= n-10 but it's probably not worth bothering with the second test, since we're looking for MCVs. Note that this says nothing about the margin of error of p, just that it is reasonable to treat it as having a normal distribution, which then allows the margin of error to be analysed using standard techniques. The standard way of doing this is to calculate the "standard error" of the sample proportion - see, for example [3], [4]: SE = sqrt(p*(1-p)/n) Note, however, that this formula assumes that the sample size n is small compared to the population size N, which is not necessarily the case. This can be taken into account by applying the "finite population correction" (see, for example [5]), which involves multiplying by an additional factor: SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) This correction factor becomes 0 when n=N (whole table sampled => no error) and it approaches 1 when n is much smaller than N. Having computed the standard error of the sample proportion, it is then possible to establish a confidence interval around p. For example, one could say that there is a 95% probability that the total population proportion lies in the range [ pmin = p-2*SE, pmax = p+2*SE ] This gives rise to the possibility of writing another rule for candidate MCVs: If there are Nd distinct values in the table, so that the average frequency of occurrence of any particular value is 1/Nd, then the test pmin > 1/Nd would imply that there is a 97.5% probably that the candidate value is more common than the average (since the 5% of the distribution of p outside the confidence interval is split evenly below pmin and above pmax). To take a concrete example, suppose that the table has N=1000,000 rows, with Nd=500 distinct values, so that each value occurs on average 2000 times. If the sample size is 30,000 then the current rule that a value needs to have an estimated frequency 25% over the average means that a value would need to be seen 75 times to be considered as an MCV. If that rule were changed to "at least double the estimated average frequency", then a value would need to be seen at least 150 times. On the other hand, using the above rule based on a 95% confidence interval, a value would only need to be seen 78 times, in which case the estimated total number of occurrences of the value in the table would be 2600+/-579, and using the MCV estimate would almost certainly be better than not using it. Regards, Dean [1] https://onlinecourses.science.psu.edu/stat200/node/43 [2] https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation [3] http://mathworld.wolfram.com/SampleProportion.html [4] https://en.wikipedia.org/wiki/Population_proportion [5] http://stattrek.com/statistics/dictionary.aspx?Definition=finite_population_correction [6] https://en.wikipedia.org/wiki/Margin_of_error -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers