RE: Finding 'probable' duplicate records

John . Brooking Fri, 07 Dec 2001 05:45:06 -0800

Carl,

  I don't have a lot of Perl-specific advice, but if it's possible to
dependably parse each line into the component fields (last name, first name,
street address, etc.), you could apply some intelligent guesses using the
various fields. If this is possible, here's what worked pretty well for me
with a mailing list database I once worked on:


   * Compare last names using the Soundex algorithm (which translates
similarly spelled names into the same code value; I believe there a perlmod
or two for this).
   * Create a database of common first name equivalencies (ie. Robert = Bob,
etc) and compare first names by these values. (Per your example, you would
also want to consider initials as a possible match to any first name
starting with that letter.)
   * Parse the street address into numbers, street names, and street labels
("Street", "St.", "Road", etc.), and consider any address with the same
street name and zip code or city as a possible match, regardless of the
numbers. Even better, compare street names using Soundex too.

   Hope this helps. I'll leave the Perl code for this as an exercise for the
reader! :-)

- John

> -----Original Message-----
> From: Carl Rogers [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, December 06, 2001 8:33 PM
> To: [EMAIL PROTECTED]
> Subject: Finding 'probable' duplicate records
> 
> 
> Good day;
> Let me start off by saying that by just reading this list, 
> I've received a 
> lot of great information that I have been able to put to use....
> (Now that I've flattered all you gurus....)
> 
> I have code that finds true (exact) duplicates in records:
> 
> while (<INFILE>)
> {
>    if (not $seen{$_})
>    {
>     $seen{$_} = 1;
>     print OUTFILE;
>     }
>    else
>    {
>       # this record is a duplicate
>     }
> }
> 
> I've gone so far as to manipulate $_ so as to remove \W 
> globally and make 
> it uppercase (so that Mr Zorkoff and MR ZORKOFF gets flagged as a 
> duplicate). This works well also.
> 
> What I'd like to do now next, if it's possible, is to catch 
> items like the 
> following:
> Mr. Tom Zorkoff 123 Elm St. NE Chicago, Illinois
> Mr. T. Zorkoff 123 Elm Street North East, Chicago, IL
> 
> To do this, I'm hoping there is a way I can calculate the number of 
> differences between line 1 and line 2 and use a percentage to 
> determine if 
> it should be considered unique or not.
> 
> I don't understand how (or if) $_ could tell that $seen{$_} 
> is a potential 
> candidate, then go so far as counting the differences between the two.
> 
> My thought is that I could get a scalar value for a 
> substitution (i.e.: 
> $result = () = $_ =~ /$seen{$_}/g; or something like that??), but I'm 
> afraid that as soon as there is a single difference between the two, 
> $result will be 0 (false).
> 
> Am I barking up a beanstalk??? I hope this makes sense. Any 
> and all help is 
> greatly appreciated.
> Thanks,
> Carl
> 
> 
> -- 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
This message may contain information which is private, privileged or
confidential and is intended solely for the use of the individual or entity
named in the message. If you are not the intended recipient of this message,
please notify the sender thereof and destroy / delete the message. Neither
the sender nor Sappi Limited (including its subsidiaries and associated
companies) shall incur any liability resulting directly or indirectly from
accessing any of the attached files which may contain a virus or the like. 

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Finding 'probable' duplicate records

Reply via email to