On 2015-08-06 22:17, Josh Berkus wrote:
On 08/06/2015 01:14 PM, Josh Berkus wrote:
On 08/06/2015 01:10 PM, Simon Riggs wrote:
Given, user-stated probability of accessing a block of P and N total
blocks, there are a few ways to implement block sampling.

1. Test P for each block individually. This gives a range of possible
results, with 0 blocks being possible outcome, though decreasing in
probability as P increases for fixed N. This is the same way BERNOULLI
works, we just do it for blocks rather than rows.

2. We calculate P/N at start of scan and deliver this number blocks by
random selection from N available blocks.

At present we do (1), exactly as documented. (2) is slightly harder
since we'd need to track which blocks have been selected already so we
can use a random selection with no replacement algorithm. On a table
with uneven distribution of rows this would still return a variable
sample size, so it didn't seem worth changing.

Aha, thanks!

So, seems like this is just a doc issue? That is, we just need to
document that using SYSTEM on very small sample sizes may return
unexpected numbers of results ... and maybe also how the algorithm
actually works.


Yes, it's expected behavior on very small sample size so doc patch seems best fix.

Following up on this ... where is TABLESAMPLE documented other than in
the SELECT command?  Doc search on the website is having issues right
now.  I'm happy to write a doc patch.


The user documentation is only in SELECT page, the rest is API docs.

--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to