> -----Original Message-----
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 24, 2005 2:08 PM
> To: Andrew Dunstan
> Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
> [...]
> Actually, that paper looks *really* promising.   Does anyone here have
> enough math to solve for D(sub)Md on page 6?   I'd like to test it on
> samples of < 0.01%.    
> [...]

D_Md = [1 - sqrt(f_1 / s)] D_b + sqrt(f_1 / s) D_B

s = block size

f~_1 = median frequency within blocks for distinct values occurring in
  only one block

D_b = d + f_1^(b+1)

d = distinct classes in the sample

f_1^(b+1) = number of distinct values occurring in a single block in
  a sample of b+1 blocks

D_B = d + [B / (b + 1)] f_1^(b+1)

b = sample size (in blocks)

B = total table size (in blocks)

f_k and f~_k are the only tricky functions here, but they are easy to 

Suppose our column contains values from the set {a, b, c, ..., z}.
Suppose we have a sample of b = 10 blocks.
Suppose that the value 'c' occurs in exactly 3 blocks (we don't care
how often it occurs *within* those blocks).
Suppose that the value 'f' also occurs in exactly 3 blocks.
Suppose that the values 'h', 'p', and 'r' occur in exactly 3 blocks.
Suppose that no other value occurs in exactly 3 blocks.

f_3^b = 5

This is because there are 5 distinct values that occur in exactly
3 blocks.  f_1^b is the number of distinct values that occur in
exactly 1 block, regardless of how often it occurs within that block.

Note that when you select a sample size of b blocks, you actually
need to sample b+1 blocks to compute f_1^(b+1).  This is actually
pedantry since all occurrences of b in the formula are really b+1.

f~ is slightly trickier.  First, we pick the distinct values that
occur in only one block.  Then, we count how often each value
occurs within its block.  To wit:

Suppose we have a set {d, q, y, z} of values that occur in only
one block.
Suppose that d occurs 3x, q occurs 1x, y occurs 8x, and z occurs 6x.

The function f- would take the mean of these counts to determine
the "cluster frequency".  So f- here would be 4.5.  This allows
one to compute D_MF.

The function f~ takes the median of this sample, which is 3 or 6
(or I suppose you could even take the mean of the two medians if
you wanted).

No tricky math involved.  That should be enough to tell you how to
write the estimator.

David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to