On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli <[email protected]> wrote: > On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote: >> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote: > >> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples >> > FROM some_table WHERE ctid =~ ANY (rnd_pgtids)) >> > FROM random_pages('some_table', 50) AS rnd_pgtids; >> >> But I don't understand the reservoir_sample call, what is it supposed to do ? > > Ok got it, that was probably to avoid: > > ERROR: more than one row returned by a subquery used as an expression
No, it's to avoid bias towards tuples on more sparsely populated pages. See http://en.wikipedia.org/wiki/Reservoir_sampling or http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/analyze.c;h=ff271644e0f93ee99bfe9c1f536f3dd48455d8d2;hb=HEAD#l1027 > The advanced TID operator would be for random_tids to only return pages rather > than full tids... Exactly. But when mainly IO bound (ie. sampling from a large table on spinning rust) the overhead of probing with TID scan as opposed to sequentially scanning the pages should be small enough. When CPU bound I suspect that the function call machinery overhead for reservoir_sample is going to become a large issue, so a built in tablesample also has an edge there. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
