Re: [ngram] "Can you do this with NSP/Ngram" type question: Name Matching?

Alan Mead Mon, 19 Dec 2005 11:03:38 -0800

Dave,

I've been doing something like what you are talking about. I was interested to see what responses you got and to see if anyone was using NSP to do/improve on this.

I started out with latent semantic analysis on larger strings than names... maybe you're familiar... basically it's a statistical technique for computing numeric similarity measures of longish strings (like documents) based mainly on the co-occurence of terms. While this works great on longer strings, it's not even remotely effective on names and addresses.

That's where n-grams come in. I compute trigrams of the letters of names and addresses and perform LSA on strings of trigrams (Dave Bothwell "_DA DAV AVE VE_ E_B _BO ..." has many co-occurences with David Bothwell "_DA DAV AVI VID ID_ D_B _BO ..."). This works reasonably well ("Ellen" and "Allen" kinds of false positives are too common) but LSA is pretty cumbersome and works best in a context where you have a collection of strings and you are querying new strings to see if they are similar to any in your collection. I had to roll my own LSA code which was unpleasant. There is a perl package for "contextual network analysis" that I would try first...

When you talk about ratios, it sounds like you are computing membership likelihoods... I don't know how you would compute those likelihoods but if you can do so, that approach should be the most powerful... I think that would work best of you were sure that each name on your first list appears in the second list... which I think is not the situation?

Help this helps. I'd be interested in hearing more about your approach.. either here or off-line.

-Alan

----- Original Message ----
From: dave1234870 <[EMAIL PROTECTED]>
To: ngram@yahoogroups.com
Sent: Friday, December 16, 2005 12:48:16
Subject: [ngram] "Can you do this with NSP/Ngram" type question: Name Matching?

Greetings all, my first post to the group hoping that its an appropriate forum for this question ... if not, my apologies to the group. I'm a moderately proficient, self-taught Perl hacker working in the fraud examination type industry. I work with large amounts of data to identify scenarios wherein Names and/or Addresses serve as nexus points for discrete network analysis. Of course, my problem is that names and addresses are quite often misspelled or not consistent. Examples, John Edwards Jon Edwards 123 Main Street 123 Main St PO Box 123 Post Office Box 123 etc. I've read over the docs for the NSP package, but am having a hard time wrapping my brain around it. Would it be possible for the NSP package (count.pl and statistic.pl) to accomplish a test upon a pair of names to achieve a match probability ratio? In a perfect world, I want to open a large file with 1 long list of names. Starting at the first name, I want to iterate over the entire list and achieve ratio proabilities for each pair of names. As each ratio is computed, I'll test it for a threshold and if the pair exceeds a threshold, I'll push it to an array. Repeat for the 2nd name in the list, 3rd name in the list, etc. Thanks in advance for any wisdom you might have on this question :-)

YAHOO! GROUPS LINKS

Visit your group "ngram" on the web.
To unsubscribe from this group, send an email to: [EMAIL PROTECTED]
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.

Re: [ngram] "Can you do this with NSP/Ngram" type question: Name Matching?

Reply via email to