On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus <j...@agliodbs.com> wrote: > They will certainly do so if you don't apply any statistical adjustments > for selecting more rows from the same pages. > > So there's a set of math designed to calculate for the skew introduced > by reading *all* of the rows in each block.
I just think this is an oversimplification. There's no skew introduced just by reading all the rows in each block unless there's some kind of dependency between the block a row is placed on and the data in it. So I don't believe there can be some single set of math that automatically removes any skew automatically. The math will depend on what the dependency is. Just to be clear, you have to think pretty hard about the way Postgres internals work to see what kinds of skew might be appearing here. Due to the way Postgres updates work and HOT cleanups work "hot" tuples will be weighted less than "cold" tuples. That's not going to be something someone in ACM knew to design into their maths. I do have access to ACM or other academic articles if you remember any author names or any keywords but if it's a database journal I would worry about patent issues. Do you remember if it was over 17 years old? > Obviously, you need to look at more rows than you would with a > pure-random sample. Like I said, the 80%+ accurate point in the papers > seemed to be at a 5% sample. I really don't believe the 5% thing. It's not enough for n_distinct and it's *far* too high a value for linear properties like histograms or nullfrac etc. From a computer point of view it's too high to be worth bothering. If we have to read 5% of the table we might as well do a full scan anyways, it'll be marginally slower but much better quality results. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers