Josh Berkus <j...@agliodbs.com> writes:
> A while back I did a fair bit of reading on ndistinct and large tables
> from the academic literature.  The consensus of many papers was that it
> took a sample of at least 3% (or 5% for block-based) of the table in
> order to have 95% confidence in ndistinct of 3X.  I can't imagine that
> MCV is easier than this.

You've got that entirely backwards.  ndistinct is hard because there
could be a whole lot of values with very small occurrence counts,
and sampling is not going to help you distinguish between things
that occur but once and things that occur only two or three times.
But whether the table has a lot of the first or a lot of the second
means a difference of 2X or 3X on ndistinct.

MCVs, on the other hand, are MCVs precisely because they occur a lot,
and so they are highly likely to show up in a sample.  You *can* get a
decent estimate of the first few MCVs from a sample, you just have to be
cautious about not believing something is an MCV if it's only a small
proportion of your sample.

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to