On 1/10/19 4:20 PM, Dean Rasheed wrote: > On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: >> On 1/8/19 3:18 PM, Dean Rasheed wrote: >>> So actually, the estimate for a group of values will be either the MCV >>> item's frequency (if the MCV item is kept), or (roughly) the MCV >>> item's base_frequency (if the MCV item is not kept). That suggests >>> that we should simply keep items that are significantly more or less >>> common than the item's base frequency -- i.e., keep rule (b) and ditch >>> rule (a). >>> >> >> Hmmm, but won't that interfere with how we with how we extrapolate the >> MCV estimate to the non-MCV part? Currently the patch does what you >> proposed, i.e. >> >> other_sel = simple_sel - mcv_basesel; >> >> I'm worried that if we only include the items that are significantly >> more or less common than the base frequency, it may skew the other_sel >> estimate. >> > > I don't see how that would skew other_sel. Items close to the base > frequency would also tend to be close to simple_sel, making other_sel > approximately zero, so excluding them should have little effect.
Oh, I see. You're right those items should contribute very little to other_sel, I should have realized that. > However... > > Re-reading the thread where we enhanced the per-column MCV stats last > year [1], it was actually the case that an algorithm based on just > looking at the relative standard error worked pretty well for a very > wide range of data distributions. > > The final algorithm chosen in analyze_mcv_list() was only a marginal > improvement on that, and was directly based upon the fact that, in the > univariate statistics case, all the values not included in the MCV > list are assigned the same selectivity. However, that's not the case > for multivariate stats, because each group not included in the > multivariate MCV list gets assigned a different selectivity based on > its per-column stats. > > So perhaps what we should do for multivariate stats is simply use the > relative standard error approach (i.e., reuse the patch in [2] with a > 20% RSE cutoff). That had a lot of testing at the time, against a wide > range of data distributions, and proved to be very good, not to > mention being very simple. > > That approach would encompass both groups more and less common than > the base frequency, because it relies entirely on the group appearing > enough times in the sample to infer that any errors on the resulting > estimates will be reasonably well controlled. It wouldn't actually > look at the base frequency at all in deciding which items to keep. > > Moreover, if the group appears sufficiently often in the sample to > justify being kept, each of the individual column values must also > appear at least that often as well, which means that the errors on the > base frequency estimate are also well controlled. That was one of my > concerns about other algorithms such as "keep items significantly more > or less common than the base frequency" -- in the less common case, > there's no lower bound on the number of occurrences seen, and so no > guarantee that the errors are kept under control. > Yep, that looks like a great approach. Simple and tested. I'll try tweaking the patch accordingly over the weekend. Thanks! -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services