Re: [HACKERS] proposal : cross-column stats

Tomas Vondra Fri, 17 Dec 2010 13:13:46 -0800

Dne 17.12.2010 19:58, Robert Haas napsal(a):
> I haven't read the paper yet (sorry) but just off the top of my head,
> one possible problem here is that our n_distinct estimates aren't
> always very accurate, especially for large tables.  As we've discussed
> before, making them accurate requires sampling a significant
> percentage of the table, whereas all of our other statistics can be
> computed reasonably accurately by sampling a fixed amount of an
> arbitrarily large table.  So it's possible that relying more heavily
> on n_distinct could turn out worse overall even if the algorithm is
> better.  Not sure if that's an issue here, just throwing it out
> there...


Yes, you're right - the paper really is based on (estimates of) number
of distinct values for each of the columns as well as for the group of
columns.

AFAIK it will work with reasonably precise estimates, but the point is
you need an estimate of distinct values of the whole group of columns.
So when you want to get an estimate for queries on columns (a,b), you
need the number of distinct value combinations of these two columns.

And I think we're not collecting this right now, so this solution
requires scanning the table (or some part of it).

I know this is a weak point of the whole solution, but the truth is
every cross-column stats solution will have to do something like this. I
don't think we'll find a solution with 0 performance impact, without the
need to scan sufficient part of a table.

That's why I want to make this optional so that the users will use it
only when really needed.

Anyway one possible solution might be to allow the user to set these
values manually (as in case when ndistinct estimates are not precise).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to