On Wed, Oct 20, 2010 at 6:41 PM, Josh Berkus <j...@agliodbs.com> wrote: > A while back I did a fair bit of reading on ndistinct and large tables > from the academic literature. The consensus of many papers was that it > took a sample of at least 3% (or 5% for block-based) of the table in > order to have 95% confidence in ndistinct of 3X. I can't imagine that > MCV is easier than this.
Interestingly I also read up on this but found a different and even more pessimistic conclusions. Basically unless you're willing to read about 50% or more of the table you can't make useful estimates at all and even then the estimates are pretty unreliable. Which makes a lot of sense since a handful of entries can easily completely change ndistinct > histogram size != sample size. It is in our code, but that's a bug and > not a feature. For the histogram there's a solid statistical reason why the two are related. For ndistinct I agree you would need to sample a proportion of the table and from what I read you really want that proportion to be 100%. For the MCV I'm not entirely clear yet what the right answer is. It's possible you're right but then I don't see a good algorithm for calculating mcv accurately for large sample sizes using a reasonable amount of resources. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers