Re: [HACKERS] proposal : cross-column stats

Florian Pflug Sat, 18 Dec 2010 07:59:36 -0800

On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:

>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>> Well, not really - I haven't done any experiments with it. For two
>>> columns selectivity equation is
>>> 
>>>     (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>> 
>>> where A and B are columns, dist(X) is number of distinct values in
>>> column X and sel(X) is selectivity of column X.
>> 
>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
>> if A and B are independent, the formula must reduce to sel(A) * sel(B),
>> and I cannot see how that'd work with the formula above.
> 
> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
> conditional probability, as
> 
>   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)
> 
> and "uniform correlation" assumption so that it's possible to replace the
> conditional probabilities with constants. And those constants are then
> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).


Ok, I think I understand this now. The selectivity equation actually
*does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
for sel(A).

Take the clause "A = x AND B = y" for example. Without knowing anything
about x and y, reasonable guesses for sel(A=x) and sel(B=y) are

  sel(A=x) = 1 / dist(A)
  sel(B=y) = 1 / dist(B).

This is also what we do currently, according to var_eq_non_const() in
src/backend/utils/adt/selfuncs.c, if we don't have any additional
knowledge about x (Actually, we also factor the probability of A being
NULL into this).

With these estimates, your formula becomes

  sel(A=x,B=y) = 1 / dist(A,B).

and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus

  sel(A=x,B=y) = sel(A=x) * sel(B=y).

If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
(var_eq_const() in src/backend/utils/adt/selfuncs.c). If

  sel(B=y) ~= 0,

we'd currently also conclude that

  sel(A=x,B=y) ~= 0.

With the "uniform correlation" approach, we'd instead estimate

  sel(A=x,B=y) ~= sel(A=x) / dist(B)

assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
If dist(B) is small, this estimate is much worse than what we'd currently
get, since we've effectively ignored the information that the restriction
B=y alone guarantees that only very few rows will match.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to