On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <hjp-pg...@hjp.at> wrote:

> Yes, estimating the number of distinct values from a relatively small
> sample is hard when you don't know the underlying distribution. It might
> be possible to analyze the sample to find the distribution and get a
> better estimate. But I'm not sure how useful that would really be: If
> a few values are very common and most very rare you are probably also
> much more likely to use the common values in a query: And for those you
> you would massively underestimate their frequency if you had an accurate
> n_distinct value. That might be just as bad or even worse.



This would only be true for values that are "common" but not in the MCVs
list, right?

If we could increase the sampling ratio beyond the hard coded 300x to get a
more representative sample and use that to estimate ndistinct (and also the
frequency of the most common values) but only actually stored the 100 MCVs
(or whatever the stats target is set to for the system or column) then the
issue may be mitigated without increasing planning time because of stats
that are larger than prudent, and the "only" cost should be longer
processing time when (auto) analyzing... plus overhead for considering this
potential new setting in all analyze cases I suppose.

Reply via email to