I have been looking into "Probabilistic Record Linkage and Deduplication", you might want to throw that into google without the quotes. These are methods the allow you to set up weights if fields exist or don't exist and compare records. You get an overall score for each record comparison in terms of 0.0 .. 1.0 or the %age likelihood of a match. These are methods used in databases to match records from various sources that may have missing, incomplete, or misspelled, or other "noise". Some of these techniques use HMM (Hidden Markoff Models).

So I have been looking into these for Geocoding and Address Standardization, but it could also be applied to Gedcom record (individual) matching. This is not trivial stuff to understand but from what I have read could be a VERY powerful tool for doing this kind of matching in the genealogy world. I have too many projects going at the moment to work on this, but I would be happy to point anyone that is interested in the right direction and bounce ideas around the list on it.

Does anyone else know anything about these ideas?

-Steve W.

Hugh S. Myers wrote:
I think the idea of root based tree comparison is a good one and a diff
style output likewise excellent. That said I would also implement a 'fuzzy'
comparison with particular attention paid to name similarity. Perhaps
rolling in some sort of rated value based on the sound of the names as well
(name of this kind comparison slipped my mind as soon as I wrote this
down...)

--hsm


-----Original Message-----
From: Hans Fugal [mailto:[EMAIL PROTECTED]
Sent: Monday, March 13, 2006 6:53 AM
To: perl-gedcom@perl.org
Subject: Re: gedcom diff

Sorry for the quite late reply - you can see how often I wade through
secondary lists when I get busy. :)

I've thought about a more literal gedcom diff - you give it two gedcoms
and a root pair, and it spits out exactly what the differences are in
some easy-to-digest way. I was thinking regular diff-like output,
actually, which could be interpreted by gedcom-savvy humans or by a
program with a UI. The hard part was deciding which sort order to put
the records in because the greatly influences how the diff looks.

I remember thinking along the same lines as you about comparing trees,
and I think that's the right approach, especially given a known root
pair. I think you would still want to compare names or some other
identifying characteristics in addition to just tree shape, which would
prevent merging multiple parents/spouses, not to mention children!

On Tue, 16 Nov 2004 at 06:49 -0700, [EMAIL PROTECTED] wrote:

Has anyone come up with code that can diff two .ged files? I have two
large files with data missing from each (two sides of the family). All

the

tools I have looked at seem to use soundex to find matches and my test
merge has not gone well (I just used the paf5 match/merge).

I was thinking if I wrote code where I would give it a reference person
from each file "these two are the same", the code could build a tree

from

there and then compare the trees not the names themselves to tell me

which

nodes were missing. I am still not sure how I would handle multiple
spouses, and multiple parents and such.... but I have been mulling it
around for a while.

Thought I would see if anyone had built anything along these lines.

--
Chuck


--
Hans Fugal ; http://hans.fugal.net

There's nothing remarkable about it. All one has to do is hit the
right keys at the right time and the instrument plays itself.
   -- Johann Sebastian Bach




Reply via email to