Re: [HACKERS] Gsoc2012 idea, tablesample

Qi Huang Thu, 10 May 2012 01:44:01 -0700

Hi, All    Thanks for your ideas on the implementation of TABLESAMPLE. I have a 
summary below of the high level requirements from the -hacker thread till now. 
Please give further comment and if I missed any point, please fell free to add. 
1. Build a new type of node, as I should not use SEQSCAN node, which will be 
scanning the whole table. One of the aims of TABLESAMPLE is to avoid scanning 
every row and thus save time. 
2. use TIDSCAN to directly access tuples. The below way of using ctid proposed 
by Kevin looks good.
-One technique which might be suitably random without reading the-whole table 
would be to figure out a maximum block number and tuple-ID for the table, and 
generate a series of random ctid values to-read. If the tuple doesn't exist or 
is not visible to the snapshot,-you ignore it and continue, until you have read 
the requisite number-of rows. You could try to generate them in advance and 
sort them by-block number, but then you need to solve the problems of what to 
do-if that set of ctids yields too many rows or too few rows, both of-which 
have sticky issues.--Kevin   I think this technique could be considered as an 
implementation algo for BERNOULLI method. It looks that it could still reduce a 
lot of cost compared to just assign random number to every tuple and then 
retrieve. 
3. Adding specific sampling, like dollar unit sampling, or geographic index 
support, which meets specific requirement, but costs more. -I have a gut 
feeling that Dollar Unit Sampling and other weighted-samples can be done with a 
similar approach of uniformly sampling-blocks and then running weighted 
reservoir sampling [1] over the-result.-[1]: 
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf-Cheers,--Ants 
Aasma
  Aasma proposed the above method for doing these problems, of using the 
reservoir weighted random sampling. 
4. Implement the way ANALYZE is doing sampling. -I'm looking for a way to fetch 
random samples these days so I confirm-the need for a quick way to fetch the 
same sample that "analyze"-command fetches but at SQL level.---strk;     
Proposed by Sandro Santilli, we can look at the way ANALYZE is doing the 
sampling and try to implement it in TABLESAMPLE.



The above are the four points I summarized from the -hackers discussion. I put 
a brief introduction the this gSoc project on Postgres Wiki. You are welcomed 
to check and modify to enhance it. 
http://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

I will add the ideas formalized in this mailing thread to the wiki page once 
our discussion has been completed. 

Thanks.

Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Re: [HACKERS] Gsoc2012 idea, tablesample

Reply via email to