On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> Yes, it is not a perfect statistical sample. All sampling is subject > to an error that is data dependent. Well there's random variation due to the limitations of dealing with a sample. And then there's systemic biases due to incorrect algorithms. You wouldn't be happy if the samples discarded every row with NULLs or every row older than some date etc. These things would not be corrected by larger samples. That's the kind of "error" we're talking about here. But the more I think about things the less convinced I am that there is a systemic bias introduced by reading the entire block. I had assumed larger rows would be selected against but that's not really true, they're just selected against relative to the number of bytes they occupy which is the correct frequency to sample. Even blocks that are mostly empty don't really bias things. Picture a table that consists of 100 blocks with 100 rows each (value A) and another 100 blocks with only 1 row each (value B). The rows with value B have a 50% chance of being in any given block which is grossly inflated however each block selected with value A will produce 100 rows. So if you sample 10 blocks you'll get 100x10xA and 1x10xB which will be the correct proportion. I'm not actually sure there is any systemic bias here. The larger number of rows per block generate less precise results but from my thought experiments they seem to still be accurate? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers