On 12/08/2013 10:14 AM, Greg Stark wrote: > With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123 > With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125 > With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164 > With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213 > > I'm not really sure if this is due to the blocky sample combined with > the skewed pgbench run or not. It doesn't seem to be consistently > biasing towards or against bid 1 which I believe are the only rows > that would have been touched by pgbench. Still it's suspicious that > they seem to be consistently getting less accurate as the blockiness > increases.

They will certainly do so if you don't apply any statistical adjustments for selecting more rows from the same pages. So there's a set of math designed to calculate for the skew introduced by reading *all* of the rows in each block. That's what I meant by "block-based sampling"; you read, say, 400 pages, you compile statistics on *all* of the rows on those pages, you apply some algorithms to adjust for groupings of rows based on how grouped they are. And you have a pretty good estimate of how grouped they are, because you just looked a complete sets of rows on a bunch of nonadjacent pages. Obviously, you need to look at more rows than you would with a pure-random sample. Like I said, the 80%+ accurate point in the papers seemed to be at a 5% sample. However, since those rows come from the same pages, the cost of looking at more rows is quite small, compared to the cost of looking at 64 times as many disk pages. My ACM subscription has lapsed, though; someone with a current ACM subscription could search for this; there are several published papers, with math and pseudocode. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers