Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

Josh Berkus Wed, 27 Apr 2005 08:30:06 -0700

Mischa,

> >Perhaps I can save you some time (yes, I have a degree in Math). If I
> >understand correctly, you're trying extrapolate from the correlation
> >between a tiny sample and a larger sample. Introducing the tiny sample
> >into any decision can only produce a less accurate result than just
> >taking the larger sample on its own; GIGO. Whether they are consistent
> >with one another has no relationship to whether the larger sample
> >correlates with the whole population. You can think of the tiny sample
> >like "anecdotal" evidence for wonderdrugs.


Actually, it's more to characterize how large of a sample we need.  For 
example, if we sample 0.005 of disk pages, and get an estimate, and then 
sample another 0.005 of disk pages and get an estimate which is not even 
close to the first estimate, then we have an idea that this is a table which 
defies analysis based on small samples.   Wheras if the two estimates are < 
1.0 stdev apart, we can have good confidence that the table is easily 
estimated.  Note that this doesn't require progressively larger samples; any 
two samples would work.

> I'm with Tom though in being very wary of solutions that require even
> one-off whole table scans. Maybe we need an additional per-table
> statistics setting which could specify the sample size, either as an
> absolute number or as a percentage of the table. It certainly seems that
> where D/N ~ 0.3, the estimates on very large tables at least are way way
> out.

Oh, I think there are several other cases where estimates are way out.  
Basically the estimation method we have doesn't work for samples smaller than 
0.10.   

> Or maybe we need to support more than one estimation method.

Yes, actually.   We need 3 different estimation methods:
1 for tables where we can sample a large % of pages (say, >= 0.1)
1 for tables where we sample a small % of pages but are "easily estimated"
1 for tables which are not easily estimated by we can't afford to sample a 
large % of pages.

If we're doing sampling-based estimation, I really don't want people to lose 
sight of the fact that page-based random sampling is much less expensive than 
row-based random sampling.   We should really be focusing on methods which 
are page-based.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

Reply via email to