Manfred Koizar <[EMAIL PROTECTED]> writes: > If the number of pages is B and the sample size is n, a perfect sampling > method collects a sample where all tuples come from different pages with > probability (in OpenOffice.org syntax): > p = prod from{i = 0} to{n - 1} {{c(B - i)} over {cB - i}}
So? You haven't proven that either sampling method fails to do the same. The desired property can also be phrased as "every tuple should be equally likely to be included in the final sample". What we actually have in the case of your revised algorithm is "every page is equally likely to be sampled, and of the pages included in the sample, every tuple is equally likely to be chosen". Given that there are B total pages of which we sample b pages that happen to contain T tuples (in any distribution), the probability that a particular tuple gets chosen is (b/B) * (n/T) assuming that the two selection steps are independent and unbiased. Now b, B, and n are not dependent on which tuple we are talking about. You could argue that a tuple on a heavily populated page is statistically likely to see a higher T when it's part of the page sample pool than a tuple on a near-empty page is likely to see, and therefore there is some bias against selection of the former tuple. But given a sample over a reasonably large number of pages, the contribution of any one page to T should be fairly small and so this effect ought to be small. In fact, because T directly determines our estimate of the total number of tuples in the relation, your experiments showing that the new method gives a reliable tuple count estimate directly prove that T is pretty stable regardless of exactly which pages get included in the sample. So I think this method is effectively unbiased at the tuple level. The variation in probability of selection of individual tuples can be no worse than the variation in the overall tuple count estimate. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend