Re: MCV lists for highly skewed distributions

Dean Rasheed Sun, 11 Mar 2018 01:24:08 -0800

On 6 March 2018 at 16:48, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
> On 6 March 2018 at 08:51, John Naylor <jcnay...@gmail.com> wrote:
>> On 3/5/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
>>> Attached is an updated patch.
>> Nice. The results look good.
>
> Thanks for the review.
>


So I was about ready to commit this, but decided to do more testing,
because I worry that there may be instances that a user could point to
where it makes estimates worse. Maybe that's inevitable for any change
we might make, and I don't think that should stop us from at least
trying to improve things, but it does make me wary about committing
this without a lot of testing.

In all the testing I've done so far, this looks to be a definite
improvement over the current algorithm, but it looks like it's
possible to do better.

Consider the following test case:

drop table if exists foo;
create temp table foo(a int);
insert into foo
  select * from
   (
     select x/2 from generate_series(0,19999999) g(x)
     union all
     select 0 from generate_series(0,9999999)
     union all
     select 1 from generate_series(0,999999)
     union all
     select 2 from generate_series(0,99999)
     union all
     select 3 from generate_series(0,9999)
     union all
     select 4 from generate_series(0,999)
     union all
     select 5 from generate_series(0,99)
   ) t
   order by random();
alter table foo alter column a set statistics 10000;
analyse foo;

So the table has around 31 million rows, and the stats target is maxed
out -- we're sampling around 10% of the table, and it's not possible
to sample more. Looking at the value a=5, it appears 102 times in the
table, so we can expect it to be seen around 10 times in any sample,
but the minimum count for the new algorithm in this case is 23, so it
won't be included in the MCV list. This leads to it having the same
estimate as all other non-MCV values, which turns out to be rows=5, a
considerable under-estimate.

By comparison, the current algorithm in HEAD does include this value
in the MCV list, so it gets a good estimate. On the other hand, it
generates a complete MCV-list with 10000 entries, most of which are
just random values from the list that appear twice in the table and
also happen to appear twice in the sample. So there are nearly 10000
values that are significantly over-estimated in this case.

So arguably, the new algorithm is still better than the current one
for this data, but the fact that it doesn't include a=5 in the list
bugs me, because the estimate for that single value is consistently
worse. Looking at the distribution of a=5 in the sample, it is a
hypergeometric distribution with N=31111100, n=3000000 and K=102. This
gives a sample mean of 9.8 and a variance of 8.9 (a standard deviation
of 3.0). Also, this is common enough that in fact that distribution
can be reasonably approximated by a normal distribution. The value is
excluded from the MCV list because the spread is deemed too large,
relative to the mean, so the relative error of the MCV-based estimate
would be too high. However, not including it in the MCV list causes it
to have an estimate of 5, which would correspond to a sample mean of
0.5, and the observed sample mean is more than 3 standard deviations
higher than that. So we do actually have enough information to
conclude that the value is almost certainly more common than the
non-MCV selectivity would suggest, and I think that's sufficient
justification for including it in the MCV list.

The crux of this is that the relative-standard-error algorithm is
making use of 2 pieces of information -- the sample frequency, and an
estimate of its statistical spread. However, there is a 3rd piece of
information that we know, that is not being made use of -- the
selectivity that would be assigned to the value if it were not
included in the MCV list. Looking at just the first 2 pieces of
information ensures that we get decent estimates for the values in the
MCV list (where what constitutes "decent" depends on an arbitrarily
chosen RSE cutoff), but it doesn't take into account the fact that the
values not in the list may have much worse estimates. Making use of
the non-MCV-based estimate allows us to do better -- if the MCV-based
estimate is statistically significantly higher than the non-MCV-based
estimate, then it would almost certainly be better to include that
value in the MCV list, even if its spread is quite large.

This is somewhat similar to the initial idea from Jeff Janes, except
that patch didn't take into account the spread, so it ended up making
the too-many-mcvs problem worse for uniform distributions. There's
also another subtlety with that patch -- when comparing frequencies of
values not in the MCV list, you really have to start from a fully
populated MCV list and work down, rather than starting with a empty
one and working up, because if all the common values have about the
same frequency, the most common amongst them may not be much more
common than the initial average, so you may end up with an empty MCV
list.

So attached is an updated patch, based on this new idea. The principle
behind the error estimating is similar, but the code ended up looking
very different.

Repeating the previous batch of tests, the results are broadly
similar, in that it does quite well at preventing too many or too few
MCVs, and it responds well to increasing the stats target. The main
difference is that it tends to produce a few more MCVs for each of
non-uniform distributions, making the estimates for the non-MCV values
better.

Running the test above with a variety of different stats target values
against HEAD, Jeff's original patch (P1), the RSE patch, and this new
v2 patch gives the following MCV-list sizes:

stats_target  HEAD     P1  RSE   v2
   10000     10000  10000    5    6
    8000      8000   8000    5    6
    6000      6000   6000    5    6
    4000      4000   4000    5  5/6
    2000      2000   2000  4/5    5
    1000      ~900   ~850    4    5
     500      ~240   ~250    4  4/5
     250       ~70    ~60  3/4    4
     100       ~20    ~15    3    4

So HEAD and P1 are both producing too many MCVs. The latest 2 patches
cure that problem, but the new v2 patch is better at picking out all
the common values.

I'm moving this back to a status of "Needs review" partly because the
code has changed significantly, but also because I want to do more
testing, particularly with larger datasets.

Regards,
Dean

mcv-stats-v2.patch
Description: Binary data

Re: MCV lists for highly skewed distributions

Reply via email to