Detecting (Almost) Matches for DeDuping?

Matthew Reinbold Sun, 14 Jan 2007 09:37:56 -0800

I have a dataset that I've put together from a number of client files. To this 
point I've been able to easily build a set of ColdFusion tools for using the 
data but there is a de-duping process that I need to do that I just don't now 
how to approach.


The data has a series of first and last names. While most of the time I'm able 
to detect last name, first name, and date of birth and create a unique entry in 
the unified person table. The problem comes when the names are slightly 
mis-spelled.

For example I may have:
RIVERA and RIVEERA -or-
MARTINEZ and MARTINE

and because I'm doing exact matching these are appearing as two seperate 
entries. I really don't want to eyeball the entire table (thousands of lines) 
and manually pick out problem rows. And I don't think I can completely automate 
the detection AND correction of dupes. At this point I just want to run 
ColdFusion code, have it detect potential dupes, and then let me take action.

How would I do this? Is a regular expression possible that can detect if two 
strings are ALMOST matches? Any help or suggestions would be most appreciated.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Create robust enterprise, web RIAs.
Upgrade & integrate Adobe Coldfusion MX7 with Flex 2
http://ad.doubleclick.net/clk;56760587;14748456;a?http://www.adobe.com/products/coldfusion/flex2/?sdid=LVNU

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:266546
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

Detecting (Almost) Matches for DeDuping?

Reply via email to