On 12/07/2013 11:46 AM, Robert Haas wrote: > Maybe there's some highly-principled statistical approach which could > be taken here, and if so that's fine, but I suspect not. So what I > think we should do is auto-tune the statistics target based on the > table size. If, say, we think that the generally useful range for the > statistics target is something like 10 to 400, then let's come up with > a formula based on table size that outputs 10 for small tables, 400 > for really big tables, and intermediate values for tables in the > middle.
The only approach which makes sense is to base it on a % of the table. In fact, pretty much every paper which has examined statistics estimation for database tables has determined that any estimate based on a less-than-5% sample is going to be wildly inaccurate. Not that 5% samples are 100% accurate, but at least they fit the 80/20 rule. This is the reason why implementing block-based sampling is critical; using our current "take one row out of every page" method, sampling 5% of the table means scanning the whole thing in most tables. We also need to decouple the number of MCVs we keep from the sample size. Certainly our existing sampling algo seems designed to maximize IO for the sample size. There's other qualitative improvements we could make, which Nathan Boley has spoken on. For example, our stats code has no way to recognize a normal or exponential distrbution -- it assumes that all columns are randomly distributed. If we could recoginze common distribution patterns, then not only could we have better query estimates, those would require keeping *fewer* stats, since all you need for a normal distribution are the end points and the variance. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers