Thanks guys! I really really appreciate your feedback. I didn't know a "simple" problem like People name matching would be this complicated. I knew there will be some unusual circumstances or rules, but I did not realize how much work has been done to solve parts of the problem (string matching in general). I'll take a good look at what you've suggested and I'll let you know what approach I took if I get this thing completed.
-los Grant Ingersoll-5 wrote: > > I agree, SecondString was helpful to me. Also have a look at William > Winkler's work at the US Census. We did similar things to come up > with blocking criteria to get an initial division into duplicates, > unique and undecided. Then we refined on the undecided set. No > approach is going to be perfect and you have to make decisions about > time spent versus quality. > > > On Apr 6, 2007, at 4:26 AM, eks dev wrote: > >> I've been doing this in past couple of years, and yes we use Lucene >> for some key parts of the problem. >> Basically, the problem you face is on how to run extremely high >> recall without compromising precision, hard! >> >> the key problem is performance, imagine you have DB with 10Mio >> persons you need to match against 10Mio from another list. Where >> you start is 10E6 * 10E6 comparisons, e.g with pure Edit Distance, >> it would need a couple of centuries to finish. What you need to do >> is to define clever "blocking criteria" in order to reduce this O >> (n^2) complexity curse. Lucene comes in handy for this. >> >> Another problem is fuzzy similarity in this game, you need somehow >> to create kind of "index" for Edit distance, have a look at >> Lingpipe spell checker. Also, I guess you need to support >> synonyms like William/Bill (no fuzzy) and other semantics >> constraints not modelled by Edit Distance likes. >> >> web: >> - google for "Record Linkage" >> - look at Cohen's Secondstring project >> - http://datamining.anu.edu.au/projects/linkage.html - they have >> very nice Python prototype >> >> search for "Fellegi- Sunter" articles as these are classics.... >> >> it is only hard to do it, but doable, we are doing it on c.a 200Mio >> lists. >> >> Unfortunately, my company does not give back to the community as >> I would like... >> anyhow, I hope this can help you >> >>>> >>>> I was wondering if anyone has done people name matching using >>>> Lucene. For >>>> example, I have a name coming from some external source that I >>>> would like to >>>> match with the one I have in my DB. Lets say my DB contains the >>>> name "John >>>> Smith". If the external source has something like "Smith John", >>>> "Smith, >>>> John", "J. Smith", etc., I would like to rate this matching based >>>> on some % >>>> of closeness for review later. I've searched around a bit for >>>> algorithms >>>> and I kept seeing the Levenshtein distance algorithm which I'm sure >>>> Lucene >>>> uses under the hood. So I trying to guage if Lucene is useful for >>>> doing >>>> something specific as this, or are there better algorithms and/or >>>> software >>>> out there that does name matching. Thanks in advance! >>>> >>>> -los >>>> -- >>>> View this message in context: http://www.nabble.com/Lucene-for-name- >>>> matching-tf3533454.html#a9862342 >>>> Sent from the Lucene - Java Users mailing list archive at >>>> Nabble.com. >>>> >>>> >>>> -------------------------------------------------------------------- >>>> - >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>> >>> >>> -------------------------- >>> Grant Ingersoll >>> Center for Natural Language Processing >>> http://www.cnlp.org >>> >>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ >>> LuceneFAQ >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >> -- >> View this message in context: http://www.nabble.com/Lucene-for-name- >> matching-tf3533454.html#a9863587 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> >> >> >> ___________________________________________________________ >> Now you can scan emails quickly with a reading pane. Get the new >> Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Lucene-for-name-matching-tf3533454.html#a9872800 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]