Re: [PERFORM] query slows down with more accurate stats

Manfred Koizar Fri, 16 Apr 2004 15:45:31 -0700

On Fri, 16 Apr 2004 10:34:49 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
>>      p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}
>
>So?  You haven't proven that either sampling method fails to do the
>same.


On the contrary, I believe that above formula is more or less valid for
both methods.  The point is in what I said next:
| This probability grows with increasing B.

For the one-stage sampling method B is the number of pages of the whole
table.  With two-stage sampling we have to use n instead of B and get a
smaller probability (for n < B, of course).  So this merely shows that
the two sampling methods are not equivalent.

>The desired property can also be phrased as "every tuple should be
>equally likely to be included in the final sample".

Only at first sight.  You really expect more from random sampling.
Otherwise I'd just put one random tuple and its n - 1 successors (modulo
N) into the sample.  This satisfies your condition but you wouldn't call
it a random sample.

Random sampling is more like "every possible sample is equally likely to
be collected", and two-stage sampling doesn't satisfy this condition.

But if in your opinion the difference is not significant, I'll stop
complaining against my own idea.  Is there anybody else who cares?

>You could argue that a tuple on a heavily populated page is
>statistically likely to see a higher T when it's part of the page sample
>pool than a tuple on a near-empty page is likely to see, and therefore
>there is some bias against selection of the former tuple.  But given a
>sample over a reasonably large number of pages, the contribution of any
>one page to T should be fairly small and so this effect ought to be
>small.

It is even better:  Storing a certain number of tuples on heavily
populated pages takes less pages than to store them on sparsely
populated pages (due to tuple size or to dead tuples).  So heavily
populated pages are less likely to be selected in stage one, and this
exactly offsets the effect of increasing T.

>So I think this method is effectively unbiased at the tuple level.

Servus
 Manfred

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [PERFORM] query slows down with more accurate stats

Reply via email to