On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra <t...@fuzzy.cz> wrote: > Dne 19.1.2011 20:25, Robert Haas napsal(a): >> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <t...@fuzzy.cz> wrote: >>> Yes, I was aware of this problem (amount of memory consumed with lots of >>> updates), and I kind of hoped someone will come up with a reasonable >>> solution. >> >> As far as I can see, periodically sampling some or all of the table is >> really the only thing that has a chance of working OK. You could try >> to track changes only when they're not too big, but I think that's >> still going to have nasty performance consequences. > > Yes, sampling all the table is the only way to get reliable ndistinct > estimates.
IMHO continuing to focus on worst case results is a dead end. Yes, for some distributions, ndistinct is very hard to estimate well. However, let us not forget that the current ndistinct estimator is very bad but the postgres planner is, on the whole, very good. As far as I can see this is due to two facts: 1) The distribution of values in a table is rarely pathological, and usually follows one of several common patterns. ( IOW, we have good heuristics ) 2) The choice of plan is fairly robust to mis-estimation of ndistinct, because there are only a few things the executer can choose. It doesn't usually matter if a value composes 5% or 20% ( or 99% ) of the table, we still usually need to scan the entire table. Again, in my *very* humble opinion, If the ultimate goal is to improve the planner, we should try to cut down on the number of cases in which a poor estimate of ndistinct gives a really bad plan instead of chasing after marginal gains from a better ndistinct estimator. Maybe ( as I've advocated for before ) this means parameterizing the distribution of values ( ie, find that the values are roughly uniform ), maybe this means estimating the error of our statistics and using the most robust rather than the best plan, or maybe it means something totally different. But: ( and the literature seems to support me ) pounding away at the ndistinct estimator seems like a dead end. If you think about it, it's a bit ridiculous to look at the whole table *just* to "estimate" ndistinct - if we go that far why dont we just store the full distribution and be done with it? Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers