# Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

```> -----Original Message-----
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> Actually, it's more to characterize how large of a sample
> we need.  For example, if we sample 0.005 of disk pages, and
> get an estimate, and then sample another 0.005 of disk pages
> and get an estimate which is not even close to the first
> estimate, then we have an idea that this is a table which
> defies analysis based on small samples.  ```
```

> Wheras if the two estimates are < 1.0 stdev apart, we can
> have good confidence that the table is easily estimated.

I don't buy that.  A negative indication is nothing more than
proof by contradiction.  A positive indication is mathematical
induction over the set, which in this type of context is
logically unsound.  There is no reason to believe that two
small samples with a small difference imply that a table is
easily estimated rather than that you got unlucky in your
samples.

> [...]
> Yes, actually.   We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages
> (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are
> "easily estimated"
> 1 for tables which are not easily estimated by we can't
> afford to sample a large % of pages.

I don't buy that the first and second need to be different
estimation methods.  I think you can use the same block
sample estimator for both, and simply stop sampling at
different points.  If you set the default to be a fixed
number of blocks, you could get a large % of pages on
small tables and a small % of pages on large tables, which
is exactly how you define the first two cases.  However,
I think such a default should also be overridable to a
% of the table or a desired accuracy.

Of course, I would recommend the distinct sample technique
for the third case.

> If we're doing sampling-based estimation, I really don't
> want people to lose sight of the fact that page-based random
> sampling is much less expensive than row-based random
> sampling.   We should really be focusing on methods which
> are page-based.

Of course, that savings comes at the expense of having to
account for factors like clustering within blocks.  So block
sampling is more efficient, but can also be less accurate.
Nonetheless, I agree that of the sampling estimators, block
sampling is the better technique.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129