Re: [HACKERS] Gsoc2012 idea, tablesample

Christopher Browne Tue, 17 Apr 2012 09:33:47 -0700

On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost <[email protected]> wrote:
> Qi,
>
> * Qi Huang ([email protected]) wrote:
>> > Doing it 'right' certainly isn't going to be simply taking what Neil did
>> > and updating it, and I understand Tom's concerns about having this be
>> > more than a hack on seqscan, so I'm a bit nervous that this would turn
>> > into something bigger than a GSoC project.
>>
>> As Christopher Browne mentioned, for this sampling method, it is not 
>> possible without scanning the whole data set. It improves the sampling 
>> quality but increases the sampling cost. I think it should also be using 
>> only for some special sampling types, not for general. The general sampling 
>> methods, as in the SQL standard, should have only SYSTEM and BERNOULLI 
>> methods.
>
> I'm not sure what sampling method you're referring to here.  I agree
> that we need to be looking at implementing the specific sampling methods
> listed in the SQL standard.  How much information is provided in the
> standard about the requirements placed on these sampling methods?  Does
> the SQL standard only define SYSTEM and BERNOULLI?  What do the other
> databases support?  What does SQL say the requirements are for 'SYSTEM'?


Well, there may be cases where the quality of the sample isn't
terribly important, it just needs to be "reasonable."

I browsed an article on the SYSTEM/BERNOULLI representations; they
both amount to simple picks of tuples.

- BERNOULLI implies picking tuples with a specified probability.

- SYSTEM implies picking pages with a specified probability.  (I think
we mess with this in ways that'll be fairly biased in view that tuples
mayn't be of uniform size, particularly if Slightly Smaller strings
stay in the main pages, whilst Slightly Larger strings get TOASTed...)

I get the feeling that this is a somewhat-magical feature (in that
users haven't much hope of understanding in what ways the results are
deterministic) that is sufficiently "magical" that anyone serious
about their result sets is likely to be unhappy to use either SYSTEM
or BERNOULLI.

Possibly the forms of sampling that people *actually* need, most of
the time, are more like Dollar Unit Sampling, which are pretty
deterministic, in ways that mandate that they be rather expensive
(e.g. - guaranteeing Seq Scan).
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Reply via email to