# Re: MCV lists for highly skewed distributions

```On 6 March 2018 at 08:51, John Naylor <jcnay...@gmail.com> wrote:
> On 3/5/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
>> Attached is an updated patch.
> Nice. The results look good.```
```
Thanks for the review.

> I agree it should be in a separate function. As for that large
> comment, I spent some time pouring over it to verify the math and make
> sure I understand it. I see a couple points where it might be a bit
> confusing to someone reading this code for the first time:
>
> +        * This bound is at most 25, and approaches 0 as n approaches 0 or N.
> The
> +        * case where n approaches 0 cannot happen in practice, since the
> sample
> +        * size is at least 300.
>
> I think the implication is that the bound cannot dip below 10 (the
> stated minimum to justify using techniques intended for a normal
> distributed sample), so there's no need to code a separate clamp to
> ensure 10 values. That might be worth spelling out explicitly in the
> comment.
>

Perhaps I should really say that n can't be less than 300 unless N is
too, in which case n=N. So the only case where we need to worry about
the bound approaching zero is when n --> N, and we're sampling the
entire table, or almost all of it. I'll see if I can re-word that to
be clearer.

> +        * size is at least 300.  The case where n approaches N corresponds to
> +        * sampling the whole the table, in which case it is reasonable to
> keep
> +        * the whole MCV list (have no lower bound), so it makes sense to
> apply
> +        * this formula for all inputs, even though the above derivation is
> +        * technically only valid when the right hand side is at least around
> 10.
>
> It occurred to me that if the table cardinality is just barely above
> the sample size, we could get a case where a value sampled only once
> could slip into the MCV list. With the default stat target, this would
> be tables with between 30,000 and 31,500 rows.

(actually I think that's 31250 rows, or more generally when we sample
more than around 96% of the table)

> I couldn't induce this
> behavior, so I looked at the logic that identifies MCV candidates, and
> saw a test for
>
> if (dups_cnt > 1)
>
> If I'm reading that block correctly, than a value sampled once will
> never even be considered for the MCV list, so it seems that the
> defacto lower bound for these tables is two. It might be worth
> mentioning that in the comment.
>
> It also means that this clamp on HEAD
>
> -                       if (mincount < 2)
> -                               mincount = 2;
>
>

Yes, in compute_scalar_stats(), each value is guaranteed to have been
seen at least twice. However, looking at compute_distinct_stats(),
that's not the case. I'm not sure if that matters.

When we have sampled 96% of the table, any estimate we produce based
on the number of times a value has been seen in the sample is likely
to be almost exact, even if it has only been seen once or twice. So
the estimates from the MCV list will all be good, but I suspect in
this case the estimates for all other values that didn't fit in the
MCV list will also be good, so it may not matter whether or not we
keep those MCV items. My instinct is to keep them, on the grounds that

>>   mcv_cutoff - The relative standard error cutoff percentage used (10,
>> 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
>
> I'm not quite following the negative numbers for HEAD. They're all the
> equivalent, right? Just a label for convenience to make sure you ran
> the same number of tests?
>

Yes, they should all be equivalent. They just reflect the way I ran
the tests -- HEAD vs the patch with a cutoff of 10%, HEAD vs the patch
with a cutoff of 15%, and so on. So I ended up running it against HEAD
9 times, and I didn't want to throw any of that data away. It's useful
for getting a feel for the scope of random variations between test
runs.

Regards,
Dean

```