Re: [HACKERS] Gsoc2012 idea, tablesample

Qi Huang Tue, 17 Apr 2012 20:12:41 -0700


> Date: Wed, 18 Apr 2012 02:45:09 +0300
> Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
> From: a...@cybertec.at
> To: cbbro...@gmail.com
> CC: sfr...@snowman.net; pgsql-hackers@postgresql.org
> 
> On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbro...@gmail.com> 
> wrote:
> > Well, there may be cases where the quality of the sample isn't
> > terribly important, it just needs to be "reasonable."
> >
> > I browsed an article on the SYSTEM/BERNOULLI representations; they
> > both amount to simple picks of tuples.
> >
> > - BERNOULLI implies picking tuples with a specified probability.
> >
> > - SYSTEM implies picking pages with a specified probability.  (I think
> > we mess with this in ways that'll be fairly biased in view that tuples
> > mayn't be of uniform size, particularly if Slightly Smaller strings
> > stay in the main pages, whilst Slightly Larger strings get TOASTed...) 
Looking at the definition of BERNOULLI method and it means to scan all the 
tuples, I always have a question. What is the difference of using BERNOULLI 
method with using "select * .... where rand() < 0.1"? They will both go through 
all the tuples and cost a seq-scan. If the answer to the above question is "no 
difference", I have one proposal for another method of BERNOULLI. For a 
relation, we can have all their tuples assigned an unique and continuous ID( we 
may use ctid or others). Then for each number in the set of IDs, we assign a 
random number and check whether that is smaller than the sampling percentage. 
If it is smaller, we retrieve the tuple corresponding to that ID. This method 
will not seq scan all the tuples, but it can sample by picking tuples.Thanks 

Best Regards and ThanksHuang Qi VictorComputer Science of National University 
of Singapore
Re: [HACKERS] Gsoc2012 idea, tablesample

Reply via email to