Re: [HACKERS] proposal : cross-column stats

Florian Pflug Tue, 21 Dec 2010 07:25:09 -0800

On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
>>> This is the reason why they choose to always combine the values (with
>>> varying weights).
>> 
>> There are no varying weights involved there. What they do is to express
>> P(A=x,B=y) once as
>> 
>> ...
>> 
>>  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>>              = dist(A)*P(A=x)/(2*dist(A,B)) +
>> dist(B)*P(B=x)/(2*dist(A,B))
>>              = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>> 
>> That averaging steps add *no* further data-dependent weights.
> 
> Sorry, by 'varying weights' I didn't mean that the weights are different
> for each value of A or B. What I meant is that they combine the values
> with different weights (just as you explained).


I'm still not sure we're on the same page here. The resulting formula
is *not* a weighted average of P(A=x) and P(B=y), since in general
dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
syntactically, but that's about it.

The resulting formula instead is an *unweighted* (weights 1) average of
the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
as well estimate P(A=x,B=y) with

  P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and it's *still* be the very same uniform bayesian approach, just no
longer symmetric in A and B. Which may easily be preferable if you
have reasons to believe that this estimate is more correct than the
one obtained by swapping A and B. The original paper doesn't deal with
that case simply because they don't mention how P(A=x) and P(B=y)
are obtained at all. The postgres estimator, on the other hand,
knows quite well how it derived P(A=x) and P(B=y) and may have much
higher confidence in one value than in the other.

Assume for example that you're preparing the statement

  SELECT * FROM T WHERE A = ? AND B = 1

We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
without an actual value for the parameter "?". The estimate for P(B=1),
on the other hand, can use the histogram, and will thus very likely be
much more precise. The two estimates for P(A=?,B=1) in this case are

  P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
  P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).

There's a good chance that the former beats the latter, and thus also
the average, then.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to