Re: MCV lists for highly skewed distributions

```On 3/5/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
> Attached is an updated patch. I decided to move the code that
> determines the minimum number of occurrences for an MCV list item out
> to a separate function. It's not much code, but there's a large
> comment to explain its statistical justification, and I didn't want to
> duplicate that in 2 different places (possibly to become 3, with the
> multivariate MCV stats patch).```
```
Nice. The results look good.

I agree it should be in a separate function. As for that large
comment, I spent some time pouring over it to verify the math and make
sure I understand it. I see a couple points where it might be a bit
confusing to someone reading this code for the first time:

+        * This bound is at most 25, and approaches 0 as n approaches 0 or N.
The
+        * case where n approaches 0 cannot happen in practice, since the sample
+        * size is at least 300.

I think the implication is that the bound cannot dip below 10 (the
stated minimum to justify using techniques intended for a normal
distributed sample), so there's no need to code a separate clamp to
ensure 10 values. That might be worth spelling out explicitly in the
comment.

+        * size is at least 300.  The case where n approaches N corresponds to
+        * sampling the whole the table, in which case it is reasonable to keep
+        * the whole MCV list (have no lower bound), so it makes sense to apply
+        * this formula for all inputs, even though the above derivation is
+        * technically only valid when the right hand side is at least around
10.

It occurred to me that if the table cardinality is just barely above
the sample size, we could get a case where a value sampled only once
could slip into the MCV list. With the default stat target, this would
be tables with between 30,000 and 31,500 rows. I couldn't induce this
behavior, so I looked at the logic that identifies MCV candidates, and
saw a test for

if (dups_cnt > 1)

If I'm reading that block correctly, than a value sampled once will
never even be considered for the MCV list, so it seems that the
defacto lower bound for these tables is two. It might be worth
mentioning that in the comment.

It also means that this clamp on HEAD

-                       if (mincount < 2)
-                               mincount = 2;

> I think this is ready for commit now, so barring objections, I will do
> so in the next day or so.

Fine by me. I've skimmed your test methodology and results and have
just one question below.

> There's a lot of random variation between tests, but by running a lot
> of them, the general pattern can be seen. I ran the test 9 times, with
> different MCV cutoff values in the patched code of 10%, 15%, ... 50%,
> then collected all the results from the test runs into a single CSV
> file to analyse using SQL. The columns in the CSV are:
>
>   test_name - A textual description of the data distribution used in the
> test.
>
>   mcv_cutoff - The relative standard error cutoff percentage used (10,
> 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests