Re: [HACKERS] Gsoc2012 idea, tablesample

Florian Pflug Thu, 10 May 2012 17:39:11 -0700

On May10, 2012, at 18:36 , Kevin Grittner wrote:
> Robert Haas <robertmh...@gmail.com> wrote:
> 
>> I wonder if you could do this with something akin to the Bitmap
>> Heap Scan machinery.  Populate a TID bitmap with a bunch of
>> randomly chosen TIDs, fetch them all in physical order
>> and if you don't get as many rows as you need, rinse and repeat
>> until you do.
> 
> Ay, there's the rub.  If you get too many, it is important that you
> read all the way to the end and then randomly omit some of them.


Why is that? From a statistical point of view it shouldn't matter
whether you pick N random samples, or pick M >= N random samples an
then randomly pick N from M. (random implying uniformly distributed
here).

> While a bit of a bother, that's pretty straightforward and should be
> pretty fast, assuming you're not, like, an order of magnitude high. 
> But falling short is tougher; making up the difference could be an
> iterative process, which could always wind up with having you read
> all tuples in the table without filling your sample.

But the likelihood of that happening is extremely low, no? Unless the
sampling percentage is very high, that is, but that case isn't of much
practical importance anyway.

But something else comes to mind. Does the standard permit samples taken
with the BERNOULLI method to contain the same tuple multiple times? If
not, any kind of TID-based approach will have to all previously fetched
TIDs, which seems doable but unfortunate...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Reply via email to