# Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

```Greg,

> I looked into this a while back when we were talking about changing the
> sampling method. The conclusions were discouraging. Fundamentally, using
> constant sized samples of data for n_distinct is bogus. Constant sized
> samples only work for things like the histograms that can be analyzed
> through standard statistics population sampling which depends on the law of
> large numbers.```
```
Well, unusual distributions are certainly tough.  But I think the problem
exists even for relatively well-distributed populations.    Part of it is, I
believe, the formula we are using:

n*d / (n - f1 + f1*n/N)

This is an estimation formula from Haas and Stokes in IBM Research Report RJ
10025, and is called the DUJ1 formula.
suck.   For example, take my table:

rows: 26million (N)
distinct values: 3.4million

I took a random sample of 1000 rows (n) from that table.   It contained:
968 values that occurred only once (f1)
981 distinct values (d)

Any human being looking at that sample would assume a large number of distinct
values; after all, 96.8% of the values occurred only once.   But the formula
gives us:

1000*981 / ( 1000 - 968 + ( 968 * 1000/26000000 ) ) = 30620

This is obviously dramatically wrong, by a factor of 100.  The math gets worse
as the sample size goes down:

Sample 250, 248 distinct values, 246 unique values:

250*248 / ( 250 - 246 + ( 246 * 250 / 26000000 ) ) = 15490

Even in a case with an ovewhelming majority of unique values, the formula gets
it wrong:

999 unque values in sample
998 appearing only once:

1000*999 / ( 1000 - 998 + ( 998 * 1000 / 26000000 ) ) = 490093

This means that, with a sample size of 1000 a table of 26million rows cannot
ever have with this formula more than half a million distinct values, unless
the column is a unique column.

Overall, our formula is inherently conservative of n_distinct.   That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one.  This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.

This formula appears broken everywhere:

Table: 969000 rows
Distinct values: 374000
Sample Size: 1000
Unique values in sample: 938
Values appearing only once: 918

1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308

Again, too small by a factor of 20x.

This is so broken, in fact, that I'm wondering if we've read the paper right?
I've perused the paper on almaden, and the DUJ1 formula appears considerably
more complex than the formula we're using.

Can someone whose math is more recent than calculus in 1989 take a look at
that paper, and look at the formula toward the bottom of page 10, and see if
we are correctly interpreting it?    I'm particularly confused as to what "q"
and "d-sub-n" represent.  Thanks!

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco