On 6 August 2015 at 21:14, Josh Berkus <j...@agliodbs.com> wrote: > On 08/06/2015 01:10 PM, Simon Riggs wrote: > > Given, user-stated probability of accessing a block of P and N total > > blocks, there are a few ways to implement block sampling. > > > > 1. Test P for each block individually. This gives a range of possible > > results, with 0 blocks being possible outcome, though decreasing in > > probability as P increases for fixed N. This is the same way BERNOULLI > > works, we just do it for blocks rather than rows. > > > > 2. We calculate P/N at start of scan and deliver this number blocks by > > random selection from N available blocks. >
(My mistake, that would be P*N) > > At present we do (1), exactly as documented. (2) is slightly harder > > since we'd need to track which blocks have been selected already so we > > can use a random selection with no replacement algorithm. On a table > > with uneven distribution of rows this would still return a variable > > sample size, so it didn't seem worth changing. > > Aha, thanks! > > So, seems like this is just a doc issue? That is, we just need to > document that using SYSTEM on very small sample sizes may return > unexpected numbers of results ... and maybe also how the algorithm > actually works. > For me, the docs seem exactly correct. The mathematical implications of that just aren't recorded explicitly. I will try to reword or add something to make it clear that this can return a variable number of blocks and thus produces a result with greater variability in the number of rows returned. It's documented on the SELECT page only; plus there is a whole new section on writing tablesample functions. -- Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services