On Fri, 02 Apr 2004 18:06:12 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: >You should not need to use the Vitter algorithm for the block-level >selection, since you can know the number of blocks in the table in >advance. You can just use the traditional method of choosing each block >or not with probability (k/K), where k = number of sample blocks still >needed, K = number of blocks from here to the end.
Sounds reasonable. I have to play around a bit more to get a feeling where the Vitter method gets more efficient. > You'd run the Vitter >algorithm separately to decide whether to keep or discard each live row >you find in the blocks you read. You mean once a block is sampled we inspect it in any case? This was not the way I had planned to do it, but I'll keep this idea in mind. >Question: if the table size is less than N blocks, are you going to read >every block or try to reduce the number of blocks sampled? Don't know yet. >people are setting the stats target to 100 which means a sample size of >30000 --- how do the page-access counts look in that case? rel | page size | reads ------+------------- 300 | 300 3000 | 3000 5000 | 4999 10K | 9.9K 30K | 25.8K 300K | 85K 1M | 120K 10M | 190K 100M | 260K 1G | 330K This is exactly the table I posted before (for sample size 3000) with every entry multiplied by 10. Well, not quite exactly, but the differences are far behind the decimal point. So for our purposes, for a given relation size the number of pages accessed is proportional to the sample size. Servus Manfred ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org