> Date: Wed, 18 Apr 2012 02:45:09 +0300
> Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
> From: a...@cybertec.at
> To: cbbro...@gmail.com
> CC: sfr...@snowman.net; pgsql-hackers@postgresql.org
>
> On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbro...@gmail.com>
> wrote:
> > Well, there may be cases where the quality of the sample isn't
> > terribly important, it just needs to be "reasonable."
> >
> > I browsed an article on the SYSTEM/BERNOULLI representations; they
> > both amount to simple picks of tuples.
> >
> > - BERNOULLI implies picking tuples with a specified probability.
> >
> > - SYSTEM implies picking pages with a specified probability. (I think
> > we mess with this in ways that'll be fairly biased in view that tuples
> > mayn't be of uniform size, particularly if Slightly Smaller strings
> > stay in the main pages, whilst Slightly Larger strings get TOASTed...)
Looking at the definition of BERNOULLI method and it means to scan all the
tuples, I always have a question. What is the difference of using BERNOULLI
method with using "select * .... where rand() < 0.1"? They will both go through
all the tuples and cost a seq-scan. If the answer to the above question is "no
difference", I have one proposal for another method of BERNOULLI. For a
relation, we can have all their tuples assigned an unique and continuous ID( we
may use ctid or others). Then for each number in the set of IDs, we assign a
random number and check whether that is smaller than the sampling percentage.
If it is smaller, we retrieve the tuple corresponding to that ID. This method
will not seq scan all the tuples, but it can sample by picking tuples.Thanks
Best Regards and ThanksHuang Qi VictorComputer Science of National University
of Singapore