On Wed, Mar 15, 2006 at 09:05:41AM +1100, Ron Savage wrote: > On Tue, 14 Mar 2006 11:13:33 -0500, Stephen Woodbridge wrote: > > Hi Stephen > > > I have been looking into "Probabilistic Record Linkage and > > Deduplication", you might want to throw that into google without > > Throw Probabilistic into CPAN and you'll find a few modules. Don't know if > they'll help.
This type of work, finding duplicates and merging them, was exactly the reason I started work on Gedcom.pm. Eventually, last year, I actually did some work on it. No, of course I didn't finish it, but at least there is something to build on if anyone wants to take a look at things while I am waiting for my somewhat overdue shipment of appropriately shaped tuits. The appropriate files are Gedcom::Comparison and gedcom_compare. This tries to provide a probability of two records matching based on the probabilities of each record's sub records matching. >From just looking at it right now, two obvious improvements that are required are to compare actual dates rather than their textual representations, and to match other textual records in a somewhat finer graduation than 0% or 100%. -- Paul Johnson - [EMAIL PROTECTED] http://www.pjcj.net