On 12/11/2013 02:08 PM, Greg Stark wrote:
On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <st...@mit.edu> wrote:
I'm not actually sure there is any systemic bias here. The larger
number of rows per block generate less precise results but from my
thought experiments they seem to still be accurate?

So I've done some empirical tests for a table generated by:
create table sizeskew as (select i,j,repeat('i',i) from
generate_series(1,1000) as i, generate_series(1,1000) as j);

I find that using the whole block doesn't cause any problem with the
avg_width field for the "repeat" column.That does reinforce my belief
that we might not need any particularly black magic here.

How large a sample did you use? Remember that the point of doing block-level sampling instead of the current approach would be to allow using a significantly smaller sample (in # of blocks), and still achieve the same sampling error. If the sample is "large enough", it will mask any systemic bias caused by block-sampling, but the point is to reduce the number of sampled blocks.

The practical question here is this: What happens to the quality of the statistics if you only read 1/2 the number of blocks than you normally would, but included all the rows in the blocks we read in the sample? How about 1/10 ?

Or to put it another way: could we achieve more accurate statistics by including all rows from the sampled rows, while reading the same number of blocks? In particular, I wonder if it would help with estimating ndistinct. It generally helps to have a larger sample for ndistinct estimation, so it might be beneficial.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to