Dne 17.12.2010 19:58, Robert Haas napsal(a): > I haven't read the paper yet (sorry) but just off the top of my head, > one possible problem here is that our n_distinct estimates aren't > always very accurate, especially for large tables. As we've discussed > before, making them accurate requires sampling a significant > percentage of the table, whereas all of our other statistics can be > computed reasonably accurately by sampling a fixed amount of an > arbitrarily large table. So it's possible that relying more heavily > on n_distinct could turn out worse overall even if the algorithm is > better. Not sure if that's an issue here, just throwing it out > there...
Yes, you're right - the paper really is based on (estimates of) number of distinct values for each of the columns as well as for the group of columns. AFAIK it will work with reasonably precise estimates, but the point is you need an estimate of distinct values of the whole group of columns. So when you want to get an estimate for queries on columns (a,b), you need the number of distinct value combinations of these two columns. And I think we're not collecting this right now, so this solution requires scanning the table (or some part of it). I know this is a weak point of the whole solution, but the truth is every cross-column stats solution will have to do something like this. I don't think we'll find a solution with 0 performance impact, without the need to scan sufficient part of a table. That's why I want to make this optional so that the users will use it only when really needed. Anyway one possible solution might be to allow the user to set these values manually (as in case when ndistinct estimates are not precise). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers