RE: Finding 'probable' duplicate records

Carl Rogers Fri, 07 Dec 2001 06:43:12 -0800

Hey John;
Thanks for the help....

At 08:44 AM 12/7/2001 -0500, [EMAIL PROTECTED] wrote:
>Carl,
>
>   I don't have a lot of Perl-specific advice, but if it's possible to
>dependably parse each line into the component fields (last name, first name,
>street address, etc.), you could apply some intelligent guesses using the
>various fields. If this is possible, here's what worked pretty well for me
>with a mailing list database I once worked on:
>
>    * Compare last names using the Soundex algorithm (which translates
>similarly spelled names into the same code value; I believe there a perlmod
>or two for this).


I've never heard of this (my ignorance is showing, I know), but I think 
this is more granular than I need. I was hoping that I could find a way to 
say 'Compare two strings (the fields within the strings aren't important). 
If string B has 17 common characters out of 20 in string A, you might want 
to consider that a match'.
Don't know if that's do-able or not.. hoping someone out there may be able 
to tell me if it's been done before.

I guess what I'm trying to do, is figure out how to play with (manipulate) 
$seen{$_} -1, $seen{$_}, and/or $seen{$_}+1 or at least find out how to get 
their values.

In the meantime, I'll see what I can find on Soundex.

Thanks again for the assistance.
Carl



> > -----Original Message-----
> > From: Carl Rogers [mailto:[EMAIL PROTECTED]]
> > Sent: Thursday, December 06, 2001 8:33 PM
> > To: [EMAIL PROTECTED]
> > Subject: Finding 'probable' duplicate records
> >
> >
> > Good day;
> > Let me start off by saying that by just reading this list,
> > I've received a
> > lot of great information that I have been able to put to use....
> > (Now that I've flattered all you gurus....)
> >
> > I have code that finds true (exact) duplicates in records:
> >
> > while (<INFILE>)
> > {
> >    if (not $seen{$_})
> >    {
> >     $seen{$_} = 1;
> >     print OUTFILE;
> >     }
> >    else
> >    {
> >       # this record is a duplicate
> >     }
> > }
> >
> > I've gone so far as to manipulate $_ so as to remove \W
> > globally and make
> > it uppercase (so that Mr Zorkoff and MR ZORKOFF gets flagged as a
> > duplicate). This works well also.
> >
> > What I'd like to do now next, if it's possible, is to catch
> > items like the
> > following:
> > Mr. Tom Zorkoff 123 Elm St. NE Chicago, Illinois
> > Mr. T. Zorkoff 123 Elm Street North East, Chicago, IL
> >
> > To do this, I'm hoping there is a way I can calculate the number of
> > differences between line 1 and line 2 and use a percentage to
> > determine if
> > it should be considered unique or not.
> >
> > I don't understand how (or if) $_ could tell that $seen{$_}
> > is a potential
> > candidate, then go so far as counting the differences between the two.
> >
> > My thought is that I could get a scalar value for a
> > substitution (i.e.:
> > $result = () = $_ =~ /$seen{$_}/g; or something like that??), but I'm
> > afraid that as soon as there is a single difference between the two,
> > $result will be 0 (false).
> >
> > Am I barking up a beanstalk??? I hope this makes sense. Any
> > and all help is
> > greatly appreciated.
> > Thanks,
> > Carl
> >
> >
> > --
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>This message may contain information which is private, privileged or
>confidential and is intended solely for the use of the individual or entity
>named in the message. If you are not the intended recipient of this message,
>please notify the sender thereof and destroy / delete the message. Neither
>the sender nor Sappi Limited (including its subsidiaries and associated
>companies) shall incur any liability resulting directly or indirectly from
>accessing any of the attached files which may contain a virus or the like.
>
>--
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]

RE: Finding 'probable' duplicate records

Reply via email to