On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
> I doubt there is a way to this decision with just dist(A), dist(B) and
> dist(A,B) values. Well, we could go with a rule
>  if [dist(A) == dist(A,B)] the [A => B]
> but that's very fragile. Think about estimates (we're not going to work
> with exact values of dist(?)), and then about data errors (e.g. a city
> matched to an incorrect ZIP code or something like that).

Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
kind of fragility without penalizing the non-correlated case...

> This is the reason why they choose to always combine the values (with
> varying weights).

There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as

  P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as
  P(A=x,B=y) = P(A=x|B=y)*P(B=y).

Then they assume

  P(B=y|A=x) ~= dist(A)/dist(A,B) and
  P(A=x|B=y) ~= dist(B)/dist(A,B),

and go on to average the two different ways of computing P(A=x,B=y), which
finally gives

  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
              = dist(A)*P(A=x)/(2*dist(A,B)) + dist(B)*P(B=x)/(2*dist(A,B))
              = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))

That averaging steps add *no* further data-dependent weights. 

>> I'd like to find a statistical explanation for that definition of
>> F(A,B), but so far I couldn't come up with any. I created a Maple 14
>> worksheet while playing around with this - if you happen to have a
>> copy of Maple available I'd be happy to send it to you..
> No, I don't have Maple. Have you tried Maxima
> (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
> even has an online notebook - that seems like a very comfortable way to
> exchange this kind of data.

I haven' tried them, but I will. That java-based GUI of Maple is driving
me nuts anyway... Thanks for the pointers!

best regards,
Florian Pflug

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to