On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <st...@mit.edu> wrote: > I'm not actually sure there is any systemic bias here. The larger > number of rows per block generate less precise results but from my > thought experiments they seem to still be accurate?
So I've done some empirical tests for a table generated by: create table sizeskew as (select i,j,repeat('i',i) from generate_series(1,1000) as i, generate_series(1,1000) as j); I find that using the whole block doesn't cause any problem with the avg_width field for the "repeat" column.That does reinforce my belief that we might not need any particularly black magic here. It does however cause a systemic error in the histogram bounds. It seems the median is systematically overestimated by more and more the larger the number of rows per block are used: 1: 524 4: 549 8: 571 12: 596 16: 602 20: 618 (total sample slightly smaller than normal) 30: 703 (substantially smaller sample) So there is something clearly wonky in the histogram stats that's affected by the distribution of the sample. The only thing I can think of is maybe the most common elements are being selected preferentially from the early part of the sample which is removing a substantial part of the lower end of the range. But even removing 100 from the beginning shouldn't be enough to push the median above 550. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers