Re: [HACKERS] proposal : cross-column stats

tv Tue, 21 Dec 2010 08:31:01 -0800

> On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
>>>> This is the reason why they choose to always combine the values (with
>>>> varying weights).
>>>
>>> There are no varying weights involved there. What they do is to express
>>> P(A=x,B=y) once as
>>>
>>> ...
>>>
>>>  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>>>              = dist(A)*P(A=x)/(2*dist(A,B)) +
>>> dist(B)*P(B=x)/(2*dist(A,B))
>>>              = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>>>
>>> That averaging steps add *no* further data-dependent weights.
>>
>> Sorry, by 'varying weights' I didn't mean that the weights are different
>> for each value of A or B. What I meant is that they combine the values
>> with different weights (just as you explained).
>
> I'm still not sure we're on the same page here. The resulting formula
> is *not* a weighted average of P(A=x) and P(B=y), since in general
> dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
> syntactically, but that's about it.


OK, another crazy usage or 'weights' on my side :-(

What I meant is that in the end you have two equations of P(A,B):

  P(A=x|B=y)*P(B=y) = dist(B)*P(B=y)/dist(A,B)
  P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and you need to combine those two estimates. They did that by averaging,
as they don't know which of the estimates is better.

Generally I think that is a good solution, unless you know one of the
estimates is much more reliable (although I'm not sure we should
completely omit the other estimate).

> The resulting formula instead is an *unweighted* (weights 1) average of
> the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
> as well estimate P(A=x,B=y) with
>
>   P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)
>
> and it's *still* be the very same uniform bayesian approach, just no
> longer symmetric in A and B. Which may easily be preferable if you
> have reasons to believe that this estimate is more correct than the
> one obtained by swapping A and B. The original paper doesn't deal with
> that case simply because they don't mention how P(A=x) and P(B=y)
> are obtained at all. The postgres estimator, on the other hand,
> knows quite well how it derived P(A=x) and P(B=y) and may have much
> higher confidence in one value than in the other.

OK, good point. I haven't realized that one of the estimates may be much
more reliable.

But let's assume both estimates are about the same (regarding reliability)
and let's see the following example

 A | B
 =====
 1 | 1
 1 | 1
 1 | 1
 1 | 2
 2 | 1
 2 | 2
 2 | 2
 2 | 2

Thus dist(A)=dist(B)=2, dist(A,B)=4 and

  P(A=1)=P(A=2)=P(B=1)=P(B=2)=1/2
  P(A=1,B=1)=P(A=2,B=2)=3/8
  P(A=1,B=2)=P(A=1,B=1)=1/8

According to the formula presented in the paper, the partial estimates for
P(A=1,B=2) are

  P(A=1|B=2)*P(B=2) = dist(A)/dist(A,B) * P(B=2) = 2/4 * 1/2 = 1/4
  P(B=2|A=1)*P(A=1) = dist(B)/dist(A,B) * P(A=1) = 2/4 * 1/2 = 1/4

Thus P(A=1,B=2) = (1/4 + 1/4)/2 = 1/4, so it's overestimated (2x)

 A | B
 =====
 1 | 1
 1 | 2
 1 | 2
 1 | 2
 2 | 1
 2 | 1
 2 | 1
 2 | 2

This obviously has exactly the same features (regarding number of distinct
values), and the produced estimate is exactly the same. But in this case

  P(A=1,B=2)=P(A=2,B=1)=3/8
  P(A=1,B=1)=P(A=2,B=2)=1/8

and thus the 1/4 is an underestimate (compared to 3/8).

The problem is the F(A,B) does not change at all. It's very simple to
construct examples (just use more rows) where F(A,B) returns exactly the
same value, but the estimates are off. The averaging somehow (smooths)
this of ...

But I think I'm missing something about how to use the F(?,?) to derive
the final estimate. So maybe the resulting estimate would be better.

Say there are two tables

   A | B | number of such rows
  ==============================
   1 | 1 | 1000
   1 | 2 | 1000
   2 | 1 | 1000
   1 | 2 | 1000

   A | B | number of such rows
  ==============================
   1 | 1 | 1
   1 | 2 | 1999
   2 | 1 | 1999
   1 | 2 | 1

How would you estimate the P(A=1,B=1) in those cases? Assume that both
estimates are equally reliable - i.e. deduced from a histogram or MCV.

>
> Assume for example that you're preparing the statement
>
>   SELECT * FROM T WHERE A = ? AND B = 1
>
> We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
> without an actual value for the parameter "?". The estimate for P(B=1),
> on the other hand, can use the histogram, and will thus very likely be
> much more precise. The two estimates for P(A=?,B=1) in this case are
>
>   P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
>   P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).
>
> There's a good chance that the former beats the latter, and thus also
> the average, then.

OK, good point. I was not thinking about prepared statements. In this case
it makes sense to use only one of the estimates ...

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to