On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote: > I doubt there is a way to this decision with just dist(A), dist(B) and > dist(A,B) values. Well, we could go with a rule > > if [dist(A) == dist(A,B)] the [A => B] > > but that's very fragile. Think about estimates (we're not going to work > with exact values of dist(?)), and then about data errors (e.g. a city > matched to an incorrect ZIP code or something like that).

Huh? The whole point of the F(A,B)-exercise is to avoid precisely this kind of fragility without penalizing the non-correlated case... > This is the reason why they choose to always combine the values (with > varying weights). There are no varying weights involved there. What they do is to express P(A=x,B=y) once as P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as P(A=x,B=y) = P(A=x|B=y)*P(B=y). Then they assume P(B=y|A=x) ~= dist(A)/dist(A,B) and P(A=x|B=y) ~= dist(B)/dist(A,B), and go on to average the two different ways of computing P(A=x,B=y), which finally gives P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2 = dist(A)*P(A=x)/(2*dist(A,B)) + dist(B)*P(B=x)/(2*dist(A,B)) = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B)) That averaging steps add *no* further data-dependent weights. >> I'd like to find a statistical explanation for that definition of >> F(A,B), but so far I couldn't come up with any. I created a Maple 14 >> worksheet while playing around with this - if you happen to have a >> copy of Maple available I'd be happy to send it to you.. > > No, I don't have Maple. Have you tried Maxima > (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage > even has an online notebook - that seems like a very comfortable way to > exchange this kind of data. I haven' tried them, but I will. That java-based GUI of Maple is driving me nuts anyway... Thanks for the pointers! best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers