I've looked over your patch briefly, but not tested it yet.
> [construction of a pathological data set]
> So the table has around 31 million rows, and the stats target is maxed
> out -- we're sampling around 10% of the table, and it's not possible
> to sample more. Looking at the value a=5, it appears 102 times in the
> table, so we can expect it to be seen around 10 times in any sample,
> but the minimum count for the new algorithm in this case is 23, so it
> won't be included in the MCV list. This leads to it having the same
> estimate as all other non-MCV values, which turns out to be rows=5, a
> considerable under-estimate.
> So arguably, the new algorithm is still better than the current one
> for this data, but the fact that it doesn't include a=5 in the list
> bugs me, because the estimate for that single value is consistently
> worse. Looking at the distribution of a=5 in the sample, it is a
> hypergeometric distribution with N=31111100, n=3000000 and K=102. This
> gives a sample mean of 9.8 and a variance of 8.9 (a standard deviation
> of 3.0).
Mean = n * K / N = 9.8
Var = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 8.9
Stdev = sqrt(Var) = 3.0
> Also, this is common enough that in fact that distribution
> can be reasonably approximated by a normal distribution.
For the archives, because it's typically seen 10 times in the sample,
per the rule of thumb mention upthread?
> The value is
> excluded from the MCV list because the spread is deemed too large,
> relative to the mean, so the relative error of the MCV-based estimate
> would be too high. However, not including it in the MCV list causes it
> to have an estimate of 5, which would correspond to a sample mean of
> 0.5, and the observed sample mean is more than 3 standard deviations
> higher than that. So we do actually have enough information to
> conclude that the value is almost certainly more common than the
> non-MCV selectivity would suggest, and I think that's sufficient
> justification for including it in the MCV list.
> The crux of this is that the relative-standard-error algorithm is
> making use of 2 pieces of information -- the sample frequency, and an
> estimate of its statistical spread. However, there is a 3rd piece of
> information that we know, that is not being made use of -- the
> selectivity that would be assigned to the value if it were not
> included in the MCV list. Looking at just the first 2 pieces of
> information ensures that we get decent estimates for the values in the
> MCV list (where what constitutes "decent" depends on an arbitrarily
> chosen RSE cutoff), but it doesn't take into account the fact that the
> values not in the list may have much worse estimates. Making use of
> the non-MCV-based estimate allows us to do better -- if the MCV-based
> estimate is statistically significantly higher than the non-MCV-based
> estimate, then it would almost certainly be better to include that
> value in the MCV list, even if its spread is quite large.
Because we can treat the sample as normal, testing for > 2 standard
deviations means that we're "~95% sure" it's more common in the table
than the non-MCV selectivity would suggest, right? (I realize I'm not
using technically accurate language)
In your v1 patch, we didn't need a clamp on 10 values in the sample to
treat it as normal, because it was mathematically impossible to pass
the RSE test with less than 10 unless we were sampling most of the
table, in which case it didn't matter. Is there a similar
justification with the new algorithm?
> [results summary]
> So HEAD and P1 are both producing too many MCVs. The latest 2 patches
> cure that problem, but the new v2 patch is better at picking out all
> the common values.
Looks good. I'll run some tests of my own this week, trying to find
corner cases different from yours.
As far as the code goes, I haven't studied it very closely yet, but I
did want to call attention to this comment:
+ * If the value is kept in the MCV list, its population
+ * assumed to equal its sample frequency, and the distribution
+ * value's count in the sample is a hypergeomtric distribution
+ * the following standard deviation.
The part after "and" doesn't seem to follow from the first part. Would
you mind clarifying?