Re: Lucene for name matching

Grant Ingersoll Fri, 06 Apr 2007 05:15:06 -0700

I agree, SecondString was helpful to me. Also have a look at WilliamWinkler's work at the US Census. We did similar things to come upwith blocking criteria to get an initial division into duplicates,unique and undecided. Then we refined on the undecided set. Noapproach is going to be perfect and you have to make decisions abouttime spent versus quality.


On Apr 6, 2007, at 4:26 AM, eks dev wrote:

I've been doing this in past couple of years, and yes we use Lucenefor some key parts of the problem.Basically, the problem you face is on how to run extremely highrecall without compromising precision, hard!
the key problem is performance, imagine you have DB with 10Miopersons you need to match against 10Mio from another list. Whereyou start is 10E6 * 10E6 comparisons, e.g with pure Edit Distance,it would need a couple of centuries to finish. What you need to dois to define clever "blocking criteria" in order to reduce this O(n^2) complexity curse. Lucene comes in handy for this.
Another problem is fuzzy similarity in this game, you need somehowto create kind of "index" for Edit distance, have a look atLingpipe spell checker. Also, I guess you need to supportsynonyms like William/Bill (no fuzzy) and other semanticsconstraints not modelled by Edit Distance likes.
web:
- google for "Record Linkage"
- look at Cohen's Secondstring project
- http://datamining.anu.edu.au/projects/linkage.html - they havevery nice Python prototype
search for "Fellegi- Sunter" articles as these are classics....
it is only hard to do it, but doable, we are doing it on c.a 200Miolists.
Unfortunately, my company does not give back to the community asI would like...
anyhow, I hope this can help you
I was wondering if anyone has done people name matching using
Lucene.  For
example, I have a name coming from some external source that I
would like to
match with the one I have in my DB.  Lets say my DB contains the
name "John
Smith".  If the external source has something like "Smith John",
"Smith,
John", "J. Smith", etc., I would like to rate this matching based
on some %
of closeness for review later.  I've searched around a bit for
algorithms
and I kept seeing the Levenshtein distance algorithm which I'm sure
Lucene
uses under the hood.  So I trying to guage if Lucene is useful for
doing
something specific as this, or are there better algorithms and/or
software
out there that does name matching.  Thanks in advance!

-los
--
View this message in context: http://www.nabble.com/Lucene-for-name-
matching-tf3533454.html#a9862342
Sent from the Lucene - Java Users mailing list archive atNabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
View this message in context: http://www.nabble.com/Lucene-for-name-matching-tf3533454.html#a9863587
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






                
___________________________________________________________
Now you can scan emails quickly with a reading pane. Get the newYahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene for name matching

Reply via email to