> I looked into this a while back when we were talking about changing the
> sampling method. The conclusions were discouraging. Fundamentally, using
> constant sized samples of data for n_distinct is bogus. Constant sized
> samples only work for things like the histograms that can be analyzed
> through standard statistics population sampling which depends on the law of
> large numbers.

Well, unusual distributions are certainly tough.  But I think the problem 
exists even for relatively well-distributed populations.    Part of it is, I 
believe, the formula we are using:

n*d / (n - f1 + f1*n/N)

This is an estimation formula from Haas and Stokes in IBM Research Report RJ 
10025, and is called the DUJ1 formula. 
(  It appears to 
suck.   For example, take my table:

rows: 26million (N)
distinct values: 3.4million

I took a random sample of 1000 rows (n) from that table.   It contained:
968 values that occurred only once (f1)
981 distinct values (d)

Any human being looking at that sample would assume a large number of distinct 
values; after all, 96.8% of the values occurred only once.   But the formula 
gives us:

1000*981 / ( 1000 - 968 + ( 968 * 1000/26000000 ) ) = 30620

This is obviously dramatically wrong, by a factor of 100.  The math gets worse 
as the sample size goes down:

Sample 250, 248 distinct values, 246 unique values:

250*248 / ( 250 - 246 + ( 246 * 250 / 26000000 ) ) = 15490

Even in a case with an ovewhelming majority of unique values, the formula gets 
it wrong:

999 unque values in sample
998 appearing only once:

1000*999 / ( 1000 - 998 + ( 998 * 1000 / 26000000 ) ) = 490093

This means that, with a sample size of 1000 a table of 26million rows cannot 
ever have with this formula more than half a million distinct values, unless 
the column is a unique column.

Overall, our formula is inherently conservative of n_distinct.   That is, I 
believe that it is actually computing the *smallest* number of distinct 
values which would reasonably produce the given sample, rather than the 
*median* one.  This is contrary to the notes in analyze.c, which seem to 
think that we're *overestimating* n_distinct.  

This formula appears broken everywhere:

Table: 969000 rows
Distinct values: 374000
Sample Size: 1000
Unique values in sample: 938
Values appearing only once: 918

1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308

Again, too small by a factor of 20x.   

This is so broken, in fact, that I'm wondering if we've read the paper right?  
I've perused the paper on almaden, and the DUJ1 formula appears considerably 
more complex than the formula we're using.  

Can someone whose math is more recent than calculus in 1989 take a look at 
that paper, and look at the formula toward the bottom of page 10, and see if 
we are correctly interpreting it?    I'm particularly confused as to what "q" 
and "d-sub-n" represent.  Thanks!


Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to