# Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

```Mischa,

> >Perhaps I can save you some time (yes, I have a degree in Math). If I
> >understand correctly, you're trying extrapolate from the correlation
> >between a tiny sample and a larger sample. Introducing the tiny sample
> >into any decision can only produce a less accurate result than just
> >taking the larger sample on its own; GIGO. Whether they are consistent
> >with one another has no relationship to whether the larger sample
> >correlates with the whole population. You can think of the tiny sample
> >like "anecdotal" evidence for wonderdrugs.```
```
Actually, it's more to characterize how large of a sample we need.  For
example, if we sample 0.005 of disk pages, and get an estimate, and then
sample another 0.005 of disk pages and get an estimate which is not even
close to the first estimate, then we have an idea that this is a table which
defies analysis based on small samples.   Wheras if the two estimates are <
1.0 stdev apart, we can have good confidence that the table is easily
estimated.  Note that this doesn't require progressively larger samples; any
two samples would work.

> I'm with Tom though in being very wary of solutions that require even
> one-off whole table scans. Maybe we need an additional per-table
> statistics setting which could specify the sample size, either as an
> absolute number or as a percentage of the table. It certainly seems that
> where D/N ~ 0.3, the estimates on very large tables at least are way way
> out.

Oh, I think there are several other cases where estimates are way out.
Basically the estimation method we have doesn't work for samples smaller than
0.10.

> Or maybe we need to support more than one estimation method.

Yes, actually.   We need 3 different estimation methods:
1 for tables where we can sample a large % of pages (say, >= 0.1)
1 for tables where we sample a small % of pages but are "easily estimated"
1 for tables which are not easily estimated by we can't afford to sample a
large % of pages.

If we're doing sampling-based estimation, I really don't want people to lose
sight of the fact that page-based random sampling is much less expensive than
row-based random sampling.   We should really be focusing on methods which
are page-based.

--
Josh Berkus
Aglio Database Solutions
San Francisco