Finding 'probable' duplicate records

Carl Rogers Thu, 06 Dec 2001 17:39:37 -0800

Good day;
Let me start off by saying that by just reading this list, I've received a 
lot of great information that I have been able to put to use....
(Now that I've flattered all you gurus....)


I have code that finds true (exact) duplicates in records:

while (<INFILE>)
{
   if (not $seen{$_})
   {
    $seen{$_} = 1;
    print OUTFILE;
    }
   else
   {
        # this record is a duplicate
    }
}

I've gone so far as to manipulate $_ so as to remove \W globally and make 
it uppercase (so that Mr Zorkoff and MR ZORKOFF gets flagged as a 
duplicate). This works well also.

What I'd like to do now next, if it's possible, is to catch items like the 
following:
Mr. Tom Zorkoff 123 Elm St. NE Chicago, Illinois
Mr. T. Zorkoff 123 Elm Street North East, Chicago, IL

To do this, I'm hoping there is a way I can calculate the number of 
differences between line 1 and line 2 and use a percentage to determine if 
it should be considered unique or not.

I don't understand how (or if) $_ could tell that $seen{$_} is a potential 
candidate, then go so far as counting the differences between the two.

My thought is that I could get a scalar value for a substitution (i.e.: 
$result = () = $_ =~ /$seen{$_}/g; or something like that??), but I'm 
afraid that as soon as there is a single difference between the two, 
$result will be 0 (false).

Am I barking up a beanstalk??? I hope this makes sense. Any and all help is 
greatly appreciated.
Thanks,
Carl


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Finding 'probable' duplicate records

Reply via email to