I've been doing this in past couple of years, and yes we use Lucene for some key parts of the problem. Basically, the problem you face is on how to run extremely high recall without compromising precision, hard!
the key problem is performance, imagine you have DB with 10Mio persons you need to match against 10Mio from another list. Where you start is 10E6 * 10E6 comparisons, e.g with pure Edit Distance, it would need a couple of centuries to finish. What you need to do is to define clever "blocking criteria" in order to reduce this O(n^2) complexity curse. Lucene comes in handy for this. Another problem is fuzzy similarity in this game, you need somehow to create kind of "index" for Edit distance, have a look at Lingpipe spell checker. Also, I guess you need to support synonyms like William/Bill (no fuzzy) and other semantics constraints not modelled by Edit Distance likes. web: - google for "Record Linkage" - look at Cohen's Secondstring project - http://datamining.anu.edu.au/projects/linkage.html - they have very nice Python prototype search for "Fellegi- Sunter" articles as these are classics.... it is only hard to do it, but doable, we are doing it on c.a 200Mio lists. Unfortunately, my company does not give back to the community as I would like... anyhow, I hope this can help you >> >> I was wondering if anyone has done people name matching using >> Lucene. For >> example, I have a name coming from some external source that I >> would like to >> match with the one I have in my DB. Lets say my DB contains the >> name "John >> Smith". If the external source has something like "Smith John", >> "Smith, >> John", "J. Smith", etc., I would like to rate this matching based >> on some % >> of closeness for review later. I've searched around a bit for >> algorithms >> and I kept seeing the Levenshtein distance algorithm which I'm sure >> Lucene >> uses under the hood. So I trying to guage if Lucene is useful for >> doing >> something specific as this, or are there better algorithms and/or >> software >> out there that does name matching. Thanks in advance! >> >> -los >> -- >> View this message in context: http://www.nabble.com/Lucene-for-name- >> matching-tf3533454.html#a9862342 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > -------------------------- > Grant Ingersoll > Center for Natural Language Processing > http://www.cnlp.org > > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ > LuceneFAQ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Lucene-for-name-matching-tf3533454.html#a9863587 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___________________________________________________________ Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]