On Wed, Mar 15, 2006 at 09:05:41AM +1100, Ron Savage wrote:
> On Tue, 14 Mar 2006 11:13:33 -0500, Stephen Woodbridge wrote:
> 
> Hi Stephen
> 
> > I have been looking into "Probabilistic Record Linkage and
> > Deduplication", you might want to throw that into google without
> 
> Throw Probabilistic into CPAN and you'll find a few modules. Don't know if
> they'll help.

This type of work, finding duplicates and merging them, was exactly the
reason I started work on Gedcom.pm.  Eventually, last year, I actually
did some work on it.

No, of course I didn't finish it, but at least there is something to
build on if anyone wants to take a look at things while I am waiting for
my somewhat overdue shipment of appropriately shaped tuits.

The appropriate files are Gedcom::Comparison and gedcom_compare.  This
tries to provide a probability of two records matching based on the
probabilities of each record's sub records matching.

>From just looking at it right now, two obvious improvements that are
required are to compare actual dates rather than their textual
representations, and to match other textual records in a somewhat finer
graduation than 0% or 100%.

-- 
Paul Johnson - [EMAIL PROTECTED]
http://www.pjcj.net

Reply via email to