Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

Dean Rasheed Mon, 22 Jan 2018 02:04:59 -0800

On 22 January 2018 at 08:07, John Naylor <jcnay...@gmail.com> wrote:
> On 1/21/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
>> It occurs to me that maybe a better test to exclude a value from the
>> MCV list would be to demand that its relative standard error not be
>> too high. Such a test, in addition to the existing tests, might be
>> sufficient to solve the opposite problem of too many values in the MCV
>> list, because the real problem there is including a value after having
>> seen relatively few occurrences of it in the sample, and thus having a
>> wildly inaccurate estimate for it. Setting a bound on the relative
>> standard error would mean that we could have a reasonable degree of
>> confidence in estimates produced from the sample.
>
> If you don't mind, what would the math look like for that?


Using the same syntax as before:

N = Total rows in table (population size)
n = Number of rows sampled
x = Number of times a particular value appears in the sample
p = x/n = Frequency of the value in the sample

So that the standard error of the proportion is

SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

Then the relative standard error (which is usually expressed as a percentage) is

RSE = 100 * SE / p

So if we were to demand that the relative standard error was less
than, say, 10%, then the constraint would just be

SE < 0.1 * p

Note:

* This formula not valid if x is very small (see what I wrote about
being able to approximate the distribution of p with a normal
distribution). So we should also enforce the "rule of thumb" x >= 10.

* The frequency p that we're storing is the count divided by the
overall sample size. So the values for N and n above should not be the
counts that exclude NULLs. As far as this logic is concerned, NULLs
(and too-wide values) are just values not equal to value under
consideration. Thus it appears, from just a quick glance at your
patch, that you're using the wrong values for N and n.

* The RSE is a monotonically decreasing function of p in the range
[0,1], so an upper bound on the RSE equates to a lower bound on the
number of occurrences of the value.


This last point gives me a different idea. Instead of applying this
test *in addition to* the existing tests, how about applying it
*instead of* the existing tests. That is, we keep all MCVs that appear
sufficiently often that we can be reasonably confident in the
estimates produced from their sample frequencies, regardless of
whether or not they are more common than average (the average being
fairly meaningless for a highly skewed distribution anyway).

This is still keeping the most common values, but now we'd be saying
that we keep any value that appears sufficiently often in the sample
that we can extrapolate its sample frequency to produce a reasonably
accurate estimate of the population frequency, and discard values for
which the estimate is likely to be inaccurate.

I have not tested this idea at all, but it seems like it might be
worth considering. It has the nice property that the test depends
entirely on how often the value appears, rather than on other
previously computed statistics, such as Ndistinct.

Doing a quick test in pure SQL, using the highly skewed distribution
Jeff Janes gave in the other thread, with the default sample size of
30,000:

with samples as (
  select floor(-log(random())/log(2))::int  as who
  from generate_series(1,30000)
), freqs as (
  select who, count(*) as x, count(*)/30000::float8 as p
  from samples group by who
), stats as (
  select *, sqrt(p*(1-p)/30000) *
sqrt((10000000-30000)::float8/(10000000-1)) as se
  from freqs
)
select *, (10000000*p)::int||'+/-'||(2*se*10000000)::int as "95% interval",
       case when x >=10 and se < 0.1*p then 'KEEP' else 'DISCARD' end
from stats order by p desc limit 100;

it pretty consistently keeps the 8 most common values:

 who |   x   |          p           |          se          |  95%
interval   |  case
-----+-------+----------------------+----------------------+-----------------+---------
   0 | 15017 |    0.500566666666667 |  0.00288241625942075 |
5005667+/-57648 | KEEP
   1 |  7607 |    0.253566666666667 |  0.00250800597590887 |
2535667+/-50160 | KEEP
   2 |  3713 |    0.123766666666667 |     0.00189844800003 |
1237667+/-37969 | KEEP
   3 |  1855 |   0.0618333333333333 |   0.0013884757600711 |
618333+/-27770  | KEEP
   4 |   914 |   0.0304666666666667 | 0.000990788179299791 |
304667+/-19816  | KEEP
   5 |   448 |   0.0149333333333333 | 0.000699194759916533 |
149333+/-13984  | KEEP
   6 |   229 |  0.00763333333333333 | 0.000501741670388358 |
76333+/-10035   | KEEP
   7 |   108 |               0.0036 | 0.000345267009604061 |
36000+/-6905    | KEEP
   8 |    46 |  0.00153333333333333 | 0.000225565173744715 |
15333+/-4511    | DISCARD
   9 |    34 |  0.00113333333333333 | 0.000193963300230354 |
11333+/-3879    | DISCARD
  10 |    17 | 0.000566666666666667 | 0.000137191663419411 |
5667+/-2744     | DISCARD
  11 |    11 | 0.000366666666666667 | 0.000110367969704201 |
3667+/-2207     | DISCARD
  13 |     1 | 3.33333333333333e-05 | 3.32827427148958e-05 | 333+/-666
      | DISCARD
(13 rows)

and if you increase the stats target to 1000 (sample size 300,000)
then it pretty consistently keeps the 11 most common values. So, in
this very limited test at least, it seems to help with Jeff's problem,
and it's no longer the case that cranking up the default statistics
target has a weak effect on the number of MCVs kept.

In the case of a uniform distribution, this might end up keeping all
the values seen, but that's not necessarily wrong -- if they are seen
often enough to make the MCV estimates accurate, why not keep them?

Of course this would need a lot more testing with other data
distributions, and I've not given any thought to the existing
histogram bounds test. Also the 10% RSE bound is fairly arbitrary, and
could be tweaked, but perhaps this is not an entirely crazy idea.

Regards,
Dean

Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

Reply via email to