Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

Dean Rasheed Sun, 11 Jun 2017 09:11:18 -0700

On 05/06/17 09:30, Tom Lane wrote:
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value"...
>
> Second, the code also has a rule that potential MCVs need to have an
> estimated frequency at least 25% larger than what it thinks the "average"
> value's frequency is.  A rule of that general form seems like a good idea,
> but I now think the 25% threshold is far too small to do anything useful.
> In particular, in any case like this where there are more distinct values
> than there are sample rows, the "average frequency" estimate will
> correspond to less than one occurrence in the sample, so that this rule is
> totally useless to filter anything that we would otherwise consider as an
> MCV.  I wonder if we shouldn't make it be "at least double the estimated
> average frequency".
>


I think we should attempt to come up with a more principled approach
to this, taking into account the table and sample sizes. Here's what I
found, after a bit of research:

A common initial rule of thumb is that the value should occur at least
10 times in the sample - see, for example [1], [2]. The idea behind
that goes as follows:

Suppose that N is the population size (total number of rows in the
table), and n is the sample size, and that some particular candidate
value appears x times in the sample.

Then the "sample proportion" is given by

  p = x/n.

Taking a different sample would, in general, give a different value
for x, and hence for the sample proportion, so repeatedly taking
different random samples of the table would lead to a distribution of
values of p, and in general, this distribution would be a binomial
distribution, since x is the count of sampled rows matching a yes/no
question (does the row match the candidate value?).

The idea behind the rule of thumb above is that if n is large enough,
and x is not too close to 0 or n, then the binomial distribution of p
can be reasonably approximated by a normal distribution. There are
various rules of thumb for this to be true, but this one is actually
one of the stronger ones, as can be seen in [2]. Technically the rule
should be:

  x >= 10 and x <= n-10

but it's probably not worth bothering with the second test, since
we're looking for MCVs.

Note that this says nothing about the margin of error of p, just that
it is reasonable to treat it as having a normal distribution, which
then allows the margin of error to be analysed using standard
techniques.

The standard way of doing this is to calculate the "standard error" of
the sample proportion - see, for example [3], [4]:

  SE = sqrt(p*(1-p)/n)

Note, however, that this formula assumes that the sample size n is
small compared to the population size N, which is not necessarily the
case. This can be taken into account by applying the "finite
population correction" (see, for example [5]), which involves
multiplying by an additional factor:

  SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

This correction factor becomes 0 when n=N (whole table sampled => no
error) and it approaches 1 when n is much smaller than N.

Having computed the standard error of the sample proportion, it is
then possible to establish a confidence interval around p. For
example, one could say that there is a 95% probability that the total
population proportion lies in the range

  [ pmin = p-2*SE, pmax = p+2*SE ]

This gives rise to the possibility of writing another rule for candidate MCVs:

If there are Nd distinct values in the table, so that the average
frequency of occurrence of any particular value is 1/Nd, then the test

  pmin > 1/Nd

would imply that there is a 97.5% probably that the candidate value is
more common than the average (since the 5% of the distribution of p
outside the confidence interval is split evenly below pmin and above
pmax).

To take a concrete example, suppose that the table has N=1000,000
rows, with Nd=500 distinct values, so that each value occurs on
average 2000 times. If the sample size is 30,000 then the current rule
that a value needs to have an estimated frequency 25% over the average
means that a value would need to be seen 75 times to be considered as
an MCV. If that rule were changed to "at least double the estimated
average frequency", then a value would need to be seen at least 150
times. On the other hand, using the above rule based on a 95%
confidence interval, a value would only need to be seen 78 times, in
which case the estimated total number of occurrences of the value in
the table would be 2600+/-579, and using the MCV estimate would almost
certainly be better than not using it.

Regards,
Dean


[1] https://onlinecourses.science.psu.edu/stat200/node/43
[2] https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation
[3] http://mathworld.wolfram.com/SampleProportion.html
[4] https://en.wikipedia.org/wiki/Population_proportion
[5] 
http://stattrek.com/statistics/dictionary.aspx?Definition=finite_population_correction
[6] https://en.wikipedia.org/wiki/Margin_of_error


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

Reply via email to