Re: [HACKERS] Gsoc2012 idea, tablesample

Greg Stark Tue, 17 Apr 2012 10:08:01 -0700

On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne <[email protected]> wrote:
> I get the feeling that this is a somewhat-magical feature (in that
> users haven't much hope of understanding in what ways the results are
> deterministic) that is sufficiently "magical" that anyone serious
> about their result sets is likely to be unhappy to use either SYSTEM
> or BERNOULLI.


These both sound pretty useful. "BERNOULLI" is fine for cases where
you aren't worried about time dependency on your data. If you're
looking for the average or total value of some column for example.

SYSTEM just means "I'm willing to trade some unspecified amount of
speed for some unspecified amount of accuracy" which presumably is
only good if you trust the database designers to make a reasonable
trade-off for cases where speed matters and the accuracy requirements
aren't very strict.

> Possibly the forms of sampling that people *actually* need, most of
> the time, are more like Dollar Unit Sampling, which are pretty
> deterministic, in ways that mandate that they be rather expensive
> (e.g. - guaranteeing Seq Scan).

I don't know about that but the cases I would expect to need other
distributions would be ones where you're looking at the tuples in a
non-linear way. Things like "what's the average gap between events" or
"what's the average number of instances per value".  These might
require a full table scan but might still be useful if the data is
going to be subsequently aggregated or joined in ways that would be
too expensive on the full data set.

But we shouldn't let best be the enemy of the good here. Having SYSTEM
and BERNOULLI would solve most use cases and having those would make
it easier to add more later.

-- 
greg

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Reply via email to