Greg, > I looked into this a while back when we were talking about changing the > sampling method. The conclusions were discouraging. Fundamentally, using > constant sized samples of data for n_distinct is bogus. Constant sized > samples only work for things like the histograms that can be analyzed > through standard statistics population sampling which depends on the law of > large numbers.

## Advertising

Well, unusual distributions are certainly tough. But I think the problem exists even for relatively well-distributed populations. Part of it is, I believe, the formula we are using: n*d / (n - f1 + f1*n/N) This is an estimation formula from Haas and Stokes in IBM Research Report RJ 10025, and is called the DUJ1 formula. (http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf) It appears to suck. For example, take my table: rows: 26million (N) distinct values: 3.4million I took a random sample of 1000 rows (n) from that table. It contained: 968 values that occurred only once (f1) 981 distinct values (d) Any human being looking at that sample would assume a large number of distinct values; after all, 96.8% of the values occurred only once. But the formula gives us: 1000*981 / ( 1000 - 968 + ( 968 * 1000/26000000 ) ) = 30620 This is obviously dramatically wrong, by a factor of 100. The math gets worse as the sample size goes down: Sample 250, 248 distinct values, 246 unique values: 250*248 / ( 250 - 246 + ( 246 * 250 / 26000000 ) ) = 15490 Even in a case with an ovewhelming majority of unique values, the formula gets it wrong: 999 unque values in sample 998 appearing only once: 1000*999 / ( 1000 - 998 + ( 998 * 1000 / 26000000 ) ) = 490093 This means that, with a sample size of 1000 a table of 26million rows cannot ever have with this formula more than half a million distinct values, unless the column is a unique column. Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to the notes in analyze.c, which seem to think that we're *overestimating* n_distinct. This formula appears broken everywhere: Table: 969000 rows Distinct values: 374000 Sample Size: 1000 Unique values in sample: 938 Values appearing only once: 918 1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308 Again, too small by a factor of 20x. This is so broken, in fact, that I'm wondering if we've read the paper right? I've perused the paper on almaden, and the DUJ1 formula appears considerably more complex than the formula we're using. Can someone whose math is more recent than calculus in 1989 take a look at that paper, and look at the formula toward the bottom of page 10, and see if we are correctly interpreting it? I'm particularly confused as to what "q" and "d-sub-n" represent. Thanks! -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster