# Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

```On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > Overall, our formula is inherently conservative of n_distinct.   That is, I
> > believe that it is actually computing the *smallest* number of distinct
> > values which would reasonably produce the given sample, rather than the
> > *median* one.  This is contrary to the notes in analyze.c, which seem to
> > think that we're *overestimating* n_distinct.
>
> Well, the notes are there because the early tests I ran on that formula
> did show it overestimating n_distinct more often than not.  Greg is
> correct that this is inherently a hard problem :-(
>
> I have nothing against adopting a different formula, if you can find
> something with a comparable amount of math behind it ... but I fear
> it'd only shift the failure cases around.
> ```
```
Perhaps the formula is not actually being applied?

The code looks like this...
if (nmultiple == 0)
{
/* If we found no repeated values, assume it's a unique column */
}
else if (toowide_cnt == 0 && nmultiple == ndistinct)
{
/*
* Every value in the sample appeared more than once.  Assume
* the column has just these values.
*/
}
else
{
/*----------
* Estimate the number of distinct values using the estimator
* proposed by Haas and Stokes in IBM Research Report RJ 10025:

The middle chunk of code looks to me like if we find a distribution
where values all occur at least twice, then we won't bother to apply the
Haas and Stokes equation. That type of frequency distribution would be
very common in a set of values with very high ndistinct, especially when
sampled.

The comment
* Every value in the sample appeared more than once.  Assume
* the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.

Looking at Josh's application it does seem likely that when taking a
sample, all site visitors clicked more than once during their session,