Re: [HACKERS] estimating # of distinct values

Nathan Boley Wed, 19 Jan 2011 14:45:44 -0800

On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra <t...@fuzzy.cz> wrote:
> Dne 19.1.2011 20:25, Robert Haas napsal(a):
>> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <t...@fuzzy.cz> wrote:
>>> Yes, I was aware of this problem (amount of memory consumed with lots of
>>> updates), and I kind of hoped someone will come up with a reasonable
>>> solution.
>>
>> As far as I can see, periodically sampling some or all of the table is
>> really the only thing that has a chance of working OK.  You could try
>> to track changes only when they're not too big, but I think that's
>> still going to have nasty performance consequences.
>
> Yes, sampling all the table is the only way to get reliable ndistinct
> estimates.


IMHO continuing to focus on worst case results is a dead end. Yes, for
some distributions, ndistinct is very hard to estimate well. However,
let us not forget that the current ndistinct estimator is very bad but
the postgres planner is, on the whole, very good.  As far as I can see
this is due to two facts:

1) The distribution of values in a table is rarely pathological, and
usually follows one of several common patterns. ( IOW, we have good
heuristics )

2) The choice of plan is fairly robust to mis-estimation of ndistinct,
because there are only a few things the executer can choose. It
doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
the table, we still usually need to scan the entire table.

Again, in my *very* humble opinion, If the ultimate goal is to improve
the planner, we should try to cut down on the number of cases in which
a poor estimate of ndistinct gives a really bad plan instead of
chasing after marginal gains from a better ndistinct estimator. Maybe
( as I've advocated for before ) this means parameterizing the
distribution of values ( ie, find that the values are roughly uniform
), maybe this means estimating the error of our statistics and using
the most robust rather than the best plan, or maybe it means something
totally different. But: ( and the literature seems to support me )
pounding away at the ndistinct estimator seems like a dead end. If you
think about it, it's a bit ridiculous to look at the whole table
*just* to "estimate" ndistinct - if we go that far why dont we just
store the full distribution and be done with it?

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

Reply via email to