On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbro...@gmail.com> wrote: > Well, there may be cases where the quality of the sample isn't > terribly important, it just needs to be "reasonable." > > I browsed an article on the SYSTEM/BERNOULLI representations; they > both amount to simple picks of tuples. > > - BERNOULLI implies picking tuples with a specified probability. > > - SYSTEM implies picking pages with a specified probability. (I think > we mess with this in ways that'll be fairly biased in view that tuples > mayn't be of uniform size, particularly if Slightly Smaller strings > stay in the main pages, whilst Slightly Larger strings get TOASTed...)
Dealing with non uniform sizes isn't too hard. analyze.c already does that. Given a table with B blocks it takes a uniform sample of b blocks, and runs Vitter's reservoir sampling algorithm over the resulting blocks to get s random tuples in a single pass. It's quite easy to prove that this results in each tuple having an equal probability to appear in the final table. However, it isn't fully independent sampling - depending on the value of b compared to s and B, there is a slightly higher probability to see multiple tuples picked from one page. I'm too lazy to do the math, but for the analyze case of b = s it probably isn't significant for most practical purposes, even if B is really large. And it seems to me that when b >= s the reservoir sampling thresholds could be tweaked at block boundaries to even out the dependencies. The ratio of b to s could be tweaked to get lower quality sampling (samples are more spatially clumped) in exchange for less random I/O. > Possibly the forms of sampling that people *actually* need, most of > the time, are more like Dollar Unit Sampling, which are pretty > deterministic, in ways that mandate that they be rather expensive > (e.g. - guaranteeing Seq Scan). I have a gut feeling that Dollar Unit Sampling and other weighted samples can be done with a similar approach of uniformly sampling blocks and then running weighted reservoir sampling [1] over the result. [1]: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf Cheers, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers