Re: [HACKERS] proposal : cross-column stats

Tomas Vondra Fri, 17 Dec 2010 14:12:34 -0800

Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always very accurate, especially for large tables.  As we've discussed
>>> before, making them accurate requires sampling a significant
>>> percentage of the table, whereas all of our other statistics can be
>>> computed reasonably accurately by sampling a fixed amount of an
>>> arbitrarily large table.  So it's possible that relying more heavily
>>> on n_distinct could turn out worse overall even if the algorithm is
>>> better.  Not sure if that's an issue here, just throwing it out
>>> there...
>>
>> Yes, you're right - the paper really is based on (estimates of) number
>> of distinct values for each of the columns as well as for the group of
>> columns.
>>
>> AFAIK it will work with reasonably precise estimates, but the point is
>> you need an estimate of distinct values of the whole group of columns.
>> So when you want to get an estimate for queries on columns (a,b), you
>> need the number of distinct value combinations of these two columns.
>>
>> And I think we're not collecting this right now, so this solution
>> requires scanning the table (or some part of it).
> 
> Any idea how sensitive it is to the accuracy of that estimate on
> distinct value combinations? If we get that off by a factor of ten or a
> hundred, what kind of an effect does it have on the final cost estimates?


Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is

      (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.

dependency on dist(A), dist(B) and dist(A,C)
--------------------------------------------

So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.

upper bound
-----------

Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by

      (sel(A) + sel(B)) / 2

I guess we can say that (sel(A) > sel(A,B)) is generally the same as

      sel(A) = sel(A,B)

i.e. we can use the heuristict I've already offered in the PoC.

lower bound
-----------

Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to