I agree, SecondString was helpful to me. Also have a look at William Winkler's work at the US Census. We did similar things to come up with blocking criteria to get an initial division into duplicates, unique and undecided. Then we refined on the undecided set. No approach is going to be perfect and you have to make decisions about time spent versus quality.

On Apr 6, 2007, at 4:26 AM, eks dev wrote:

I've been doing this in past couple of years, and yes we use Lucene for some key parts of the problem. Basically, the problem you face is on how to run extremely high recall without compromising precision, hard!

the key problem is performance, imagine you have DB with 10Mio persons you need to match against 10Mio from another list. Where you start is 10E6 * 10E6 comparisons, e.g with pure Edit Distance, it would need a couple of centuries to finish. What you need to do is to define clever "blocking criteria" in order to reduce this O (n^2) complexity curse. Lucene comes in handy for this.

Another problem is fuzzy similarity in this game, you need somehow to create kind of "index" for Edit distance, have a look at Lingpipe spell checker. Also, I guess you need to support synonyms like William/Bill (no fuzzy) and other semantics constraints not modelled by Edit Distance likes.

web:
- google for "Record Linkage"
- look at Cohen's Secondstring project
- http://datamining.anu.edu.au/projects/linkage.html - they have very nice Python prototype

search for "Fellegi- Sunter" articles as these are classics....

it is only hard to do it, but doable, we are doing it on c.a 200Mio lists.

Unfortunately, my company does not give back to the community as I would like...
anyhow, I hope this can help you


I was wondering if anyone has done people name matching using
Lucene.  For
example, I have a name coming from some external source that I
would like to
match with the one I have in my DB.  Lets say my DB contains the
name "John
Smith".  If the external source has something like "Smith John",
"Smith,
John", "J. Smith", etc., I would like to rate this matching based
on some %
of closeness for review later.  I've searched around a bit for
algorithms
and I kept seeing the Levenshtein distance algorithm which I'm sure
Lucene
uses under the hood.  So I trying to guage if Lucene is useful for
doing
something specific as this, or are there better algorithms and/or
software
out there that does name matching.  Thanks in advance!

-los
--
View this message in context: http://www.nabble.com/Lucene-for-name-
matching-tf3533454.html#a9862342
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
View this message in context: http://www.nabble.com/Lucene-for-name- matching-tf3533454.html#a9863587
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






                
___________________________________________________________
Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to