Re: [HACKERS] proposal : cross-column stats

Tomas Vondra Sat, 18 Dec 2010 09:00:05 -0800

Dne 18.12.2010 16:59, Florian Pflug napsal(a):
> On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
> 
>>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>>> Well, not really - I haven't done any experiments with it. For two
>>>> columns selectivity equation is
>>>>
>>>>     (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>>>
>>>> where A and B are columns, dist(X) is number of distinct values in
>>>> column X and sel(X) is selectivity of column X.
>>>
>>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
>>> if A and B are independent, the formula must reduce to sel(A) * sel(B),
>>> and I cannot see how that'd work with the formula above.
>>
>> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
>> conditional probability, as
>>
>>   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)
>>
>> and "uniform correlation" assumption so that it's possible to replace the
>> conditional probabilities with constants. And those constants are then
>> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).
> 
> Ok, I think I understand this now. The selectivity equation actually
> *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
> for sel(A).
> 
> Take the clause "A = x AND B = y" for example. Without knowing anything
> about x and y, reasonable guesses for sel(A=x) and sel(B=y) are
> 
>   sel(A=x) = 1 / dist(A)
>   sel(B=y) = 1 / dist(B).
> 
> This is also what we do currently, according to var_eq_non_const() in
> src/backend/utils/adt/selfuncs.c, if we don't have any additional
> knowledge about x (Actually, we also factor the probability of A being
> NULL into this).
> 
> With these estimates, your formula becomes
> 
>   sel(A=x,B=y) = 1 / dist(A,B).
> 
> and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus
> 
>   sel(A=x,B=y) = sel(A=x) * sel(B=y).
> 
> If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
> (var_eq_const() in src/backend/utils/adt/selfuncs.c). If
> 
>   sel(B=y) ~= 0,
> 
> we'd currently also conclude that
> 
>   sel(A=x,B=y) ~= 0.
> 
> With the "uniform correlation" approach, we'd instead estimate
> 
>   sel(A=x,B=y) ~= sel(A=x) / dist(B)
> 
> assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
> If dist(B) is small, this estimate is much worse than what we'd currently
> get, since we've effectively ignored the information that the restriction
> B=y alone guarantees that only very few rows will match.


Well, I guess you're right. Let's see two examples - uncorrelated and
higly correlated data (and then a few comments on what I think you're
missing).

uncorrelated
------------
Let there be 10 distinct values in A, 100 distinct values in B, and
there are all possible combinations (10x100 = 1000). All the values (and
combinations) are uniformly distributed.

The current estimate is correct, i.e.

  sel(A=a) = 1/10 = 1/dist(A)
  sel(B=b) = 1/100 = 1/dist(B)
  sel(A=a, B=b) = sel(A) * sel(B) = 1/1000  /* due to AVI */

the proposed estimate is

  sel(A=a, B=b) = (dist(A)*sel(A=a) + dist(B)*sel(B=b)) / (2*dist(A,B))
                = (10 * 1/10 + 100 * 1/100) / 2000 = 1/1000

so actually it produces the same estimate, but thats due to the uniform
distribution assumption.

Let's say the selectivities for A and B are very different - A is 10x
les common, B is 10x more common (let's say we know this). Then the
current estimate is still correct, i.e. it gives

  sel(A=a) = 1/100
  sel(B=b) = 1/10

  => sel(A=a,B=b) = 1/1000

but the new estimate is

  (10 * 1/100 + 100 * 1/10) / 2000 = (101/10) / 2000 ~ 1/200

So yes, in case of uncorrelated data this overestimates the result.


highly correlated
-----------------
Again, let dist(A)=10, dist(B)=100. But this case let B=>A i.e. for each
value in B, there is a unique value in A. Say B=mod(A,10) or something
like that. So now there is dist(A,B)=100.

Assuming uniform distribution, the correct estimate is

  sel(A=a,B=b) = sel(B=b) = 1/100

and the current estimate is

  sel(A=a,B=b) = sel(A=a) * sel(B=b) = 1/10 * 1/100 = 1/1000

The proposed estimate is

  (10 * 1/10 + 100 * 1/100) / 200 = 2/200 = 1/100

which is correct.

Without the uniform distribution (again, sel(A)=1/100, sel(B)=1/10), we
currently get (just as before)

  sel(A=a,B=b) = 1/10 * 1/100 = 1/1000

which is 100x less than the correct value (sel(B)=1/10). The new
estimate yields

  (10*1/100 + 100*1/10) / 200 = (1010/100) / 200 ~ 1/20

which is not as accurate as with uniform distribution, but it's
considerably more accurate than the current estimate (1/1000).


comments
--------
I really don't think this a perfect solution giving 100% accurate
results in all cases. But the examples above illustrate that it gives
much better estimates in case of correlated data.

It seems to me you're missing one very important thing - this was not
meant as a new default way to do estimates. It was meant as an option
when the user (DBA, developer, ...) realizes the current solution gives
really bad estimates (due to correlation). In that case he could create
'cross-column' statistics on those columns, and the optimizer would use
that info to do the estimates.

This kind of eliminates the issue with uncorrelated data, as it would
not actually happen. OK, the user might create cross-column stats on
uncorrelated columns and he'd get incorrect estimates in that case, but
I really don't think we can solve this kind of problems.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to