Dave,

I've been doing something like what you are talking about.  I was interested to see what responses you got and to see if anyone was using NSP to do/improve on this.

I started out with latent semantic analysis on larger strings than names... maybe you're familiar... basically it's a statistical technique for computing numeric similarity measures of longish strings (like documents) based mainly on the co-occurence of terms.  While this works great on longer strings, it's not even remotely effective on names and addresses.

That's where n-grams come in.  I compute trigrams of the letters of names and addresses and perform LSA on strings of trigrams (Dave Bothwell "_DA DAV AVE VE_ E_B _BO ..." has many co-occurences with David Bothwell "_DA DAV AVI VID ID_ D_B _BO ...").  This works reasonably well ("Ellen" and "Allen" kinds of false positives are too common) but LSA is pretty cumbersome and works best in a context where you have a collection of strings and you are querying new strings to see if they are similar to any in your collection.  I had to roll my own LSA code which was unpleasant.  There is a perl package for "contextual network analysis" that I would try first...

When you talk about ratios, it sounds like you are computing membership likelihoods... I don't know how you would compute those likelihoods but if you can do so, that approach should be the most powerful... I think that would work best of you were sure that each name on your first list appears in the second list... which I think is not the situation?

Help this helps.  I'd be interested in hearing more about your approach.. either here or off-line.

-Alan

----- Original Message ----
From: dave1234870 <[EMAIL PROTECTED]>
To: ngram@yahoogroups.com
Sent: Friday, December 16, 2005 12:48:16
Subject: [ngram] "Can you do this with NSP/Ngram" type question: Name Matching?

Greetings all, my first post to the group hoping that its an
appropriate forum for this question ... if not, my apologies to the
group.

I'm a moderately proficient, self-taught Perl hacker working in the
fraud examination type industry.  I work with large amounts of data to
identify scenarios wherein Names and/or Addresses serve as nexus
points for discrete network analysis.  Of course, my problem is that
names and addresses are quite often misspelled or not consistent.
Examples,

John Edwards
Jon Edwards

123 Main Street
123 Main St

PO Box 123
Post Office Box 123
etc.

I've read over the docs for the NSP package, but am having a hard time
wrapping my brain around it.  Would it be possible for the NSP package
(count.pl and statistic.pl) to accomplish a test upon a pair of names
to achieve a match probability ratio?

In a perfect world, I want to open a large file with 1 long list of
names.  Starting at the first name, I want to iterate over the entire
list and achieve ratio proabilities for each pair of names.  As each
ratio is computed, I'll test it for a threshold and if the pair
exceeds a threshold, I'll push it to an array.  Repeat for the 2nd
name in the list, 3rd name in the list, etc.

Thanks in advance for any wisdom you might have on this question :-)




YAHOO! GROUPS LINKS




Reply via email to