On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: >Manfred Koizar <[EMAIL PROTECTED]> writes: >> What I have in mind is a kind of "Double Vitter" algorithm. [...] >> random sample of sample_size block numbers, and then to sample the rows >> out of this pool of blocks. > >That assumption is faulty, though --- consider wholly-empty pages. > >A bigger problem is that this makes the sampling quite nonuniform, >because rows that are on relatively low-density pages would be more >likely to become part of the final sample than rows that are on pages >with lots of tuples.
This sounds like you are assuming that I want to take exactly one tuple out of each block of the block sample. This is not the case. In the second round I plan to apply the same (or a better) Vitter method as it is done now. The main difference is that blocks will be adressed indirectly through the array of block numbers obtained in the first round. > Thus for example your sample would tend to favor >rows with wide values of variable-width columns and exclude narrower >values. (I am not certain that the existing algorithm completely avoids >this trap, but at least it tries.) I'm reading 7.4 source code and I fail to see how it does this. If the relation starts with an atypical distribution of wide/narrow or dead/alive tuples, a wrong value for tuplesperpage is used for the rest of the sampling. Tuples immediately following one or more dead tuples have a better chance of being selected. This may be called as random as anything else and not favouring a special property. OTOH after long runs of dead tuples consecutive tuples are likely to be selected. Your comment about nonuniformity above exactly describes the current algorithm: Once the initial sample is fetched and tuplesperpage is determined, targpos is computed without any further feedback. If targpos points to a sparsely populated area (with wide tuples or with many dead tuples) tuples in this area are more likely to get into the sample than tuples in densely populated areas (with many small active tuples). I think that cutting down the number of blocks to be looked at does not affect these problems. Servus Manfred ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly