Re: [HACKERS] TABLESAMPLE patch

Petr Jelinek Fri, 10 Apr 2015 12:59:38 -0700

On 10/04/15 21:26, Peter Eisentraut wrote:

On 4/9/15 8:58 PM, Petr Jelinek wrote:

Well, you can have two approaches to this, either allow some specific
set of keywords that can be used to specify limit, or you let sampling
methods interpret parameters, I believe the latter is more flexible.
There is nothing stopping somebody writing sampling method which takes
limit as number of rows, or anything else.


Also for example for BERNOULLI to work correctly you'd need to convert
the number of rows to fraction of table anyway (and that's exactly what
the one database which has this feature does internally) and then it's
no different than passing (SELECT 100/reltuples*number_of_rows FROM
tablename) as a parameter.


What is your intended use case for this feature?  I know that "give me
100 random rows from this table quickly" is a common use case, but
that's kind of cumbersome if you need to apply formulas like that.  I'm
not sure what the use of a percentage is.  Presumably, the main use of
this features is on large tables.  But then you might not even know how
large it really is, and even saying 0.1% might be more than you wanted
to handle.

My main intended use-case is analytics on very big tables. Thepercentages of population vs confidence levels are pretty well mappedthere and you can get quite big speedups if you are fine with gettingresults with slightly smaller confidence (ie you care about ballparkfigures).

But this was not really my point, the BERNOULLI just does not work wellwith row-limit by definition, it applies probability on each individualrow and while you can get probability from percentage very easily (justdivide by 100), to get it for specific target number of rows you have toknow total number of source rows and that's not something we can do veryaccurately so then you won't get 500 rows but approximately 500 rows.

In any case for "give me 500 somewhat random rows from table quickly"you want probably SYSTEM sampling anyway as it will be orders ofmagnitude faster on big tables and yes even 0.1% might be more than youwanted in that case. I am not against having row limit input for methodswhich can work with it like SYSTEM but then that's easily doable byadding separate sampling method which accepts rows (even if samplingalgorithm itself is same). In current approach all you'd have to do iswrite different init function for the sampling method and register itunder new name (yes it won't be named SYSTEM but for exampleSYSTEM_ROWLIMIT then).


--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TABLESAMPLE patch

Reply via email to