Re: [HACKERS] ANALYZE sampling is too good

Heikki Linnakangas Wed, 11 Dec 2013 05:03:23 -0800

On 12/11/2013 02:08 PM, Greg Stark wrote:

On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <[email protected]> wrote:

I'm not actually sure there is any systemic bias here. The larger
number of rows per block generate less precise results but from my
thought experiments they seem to still be accurate?


So I've done some empirical tests for a table generated by:
create table sizeskew as (select i,j,repeat('i',i) from
generate_series(1,1000) as i, generate_series(1,1000) as j);

I find that using the whole block doesn't cause any problem with the
avg_width field for the "repeat" column.That does reinforce my belief
that we might not need any particularly black magic here.

How large a sample did you use? Remember that the point of doingblock-level sampling instead of the current approach would be to allowusing a significantly smaller sample (in # of blocks), and still achievethe same sampling error. If the sample is "large enough", it will maskany systemic bias caused by block-sampling, but the point is to reducethe number of sampled blocks.

The practical question here is this: What happens to the quality of thestatistics if you only read 1/2 the number of blocks than you normallywould, but included all the rows in the blocks we read in the sample?How about 1/10 ?

Or to put it another way: could we achieve more accurate statistics byincluding all rows from the sampled rows, while reading the same numberof blocks? In particular, I wonder if it would help with estimatingndistinct. It generally helps to have a larger sample for ndistinctestimation, so it might be beneficial.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Reply via email to