On 3/5/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > Attached is an updated patch. I decided to move the code that > determines the minimum number of occurrences for an MCV list item out > to a separate function. It's not much code, but there's a large > comment to explain its statistical justification, and I didn't want to > duplicate that in 2 different places (possibly to become 3, with the > multivariate MCV stats patch).
Nice. The results look good. I agree it should be in a separate function. As for that large comment, I spent some time pouring over it to verify the math and make sure I understand it. I see a couple points where it might be a bit confusing to someone reading this code for the first time: + * This bound is at most 25, and approaches 0 as n approaches 0 or N. The + * case where n approaches 0 cannot happen in practice, since the sample + * size is at least 300. I think the implication is that the bound cannot dip below 10 (the stated minimum to justify using techniques intended for a normal distributed sample), so there's no need to code a separate clamp to ensure 10 values. That might be worth spelling out explicitly in the comment. + * size is at least 300. The case where n approaches N corresponds to + * sampling the whole the table, in which case it is reasonable to keep + * the whole MCV list (have no lower bound), so it makes sense to apply + * this formula for all inputs, even though the above derivation is + * technically only valid when the right hand side is at least around 10. It occurred to me that if the table cardinality is just barely above the sample size, we could get a case where a value sampled only once could slip into the MCV list. With the default stat target, this would be tables with between 30,000 and 31,500 rows. I couldn't induce this behavior, so I looked at the logic that identifies MCV candidates, and saw a test for if (dups_cnt > 1) If I'm reading that block correctly, than a value sampled once will never even be considered for the MCV list, so it seems that the defacto lower bound for these tables is two. It might be worth mentioning that in the comment. It also means that this clamp on HEAD - if (mincount < 2) - mincount = 2; is dead code. > I think this is ready for commit now, so barring objections, I will do > so in the next day or so. Fine by me. I've skimmed your test methodology and results and have just one question below. > There's a lot of random variation between tests, but by running a lot > of them, the general pattern can be seen. I ran the test 9 times, with > different MCV cutoff values in the patched code of 10%, 15%, ... 50%, > then collected all the results from the test runs into a single CSV > file to analyse using SQL. The columns in the CSV are: > > test_name - A textual description of the data distribution used in the > test. > > mcv_cutoff - The relative standard error cutoff percentage used (10, > 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests > against HEAD. I'm not quite following the negative numbers for HEAD. They're all the equivalent, right? Just a label for convenience to make sure you ran the same number of tests? -John Naylor