Hey John; Thanks for the help.... At 08:44 AM 12/7/2001 -0500, [EMAIL PROTECTED] wrote: >Carl, > > I don't have a lot of Perl-specific advice, but if it's possible to >dependably parse each line into the component fields (last name, first name, >street address, etc.), you could apply some intelligent guesses using the >various fields. If this is possible, here's what worked pretty well for me >with a mailing list database I once worked on: > > * Compare last names using the Soundex algorithm (which translates >similarly spelled names into the same code value; I believe there a perlmod >or two for this).
I've never heard of this (my ignorance is showing, I know), but I think this is more granular than I need. I was hoping that I could find a way to say 'Compare two strings (the fields within the strings aren't important). If string B has 17 common characters out of 20 in string A, you might want to consider that a match'. Don't know if that's do-able or not.. hoping someone out there may be able to tell me if it's been done before. I guess what I'm trying to do, is figure out how to play with (manipulate) $seen{$_} -1, $seen{$_}, and/or $seen{$_}+1 or at least find out how to get their values. In the meantime, I'll see what I can find on Soundex. Thanks again for the assistance. Carl > > -----Original Message----- > > From: Carl Rogers [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, December 06, 2001 8:33 PM > > To: [EMAIL PROTECTED] > > Subject: Finding 'probable' duplicate records > > > > > > Good day; > > Let me start off by saying that by just reading this list, > > I've received a > > lot of great information that I have been able to put to use.... > > (Now that I've flattered all you gurus....) > > > > I have code that finds true (exact) duplicates in records: > > > > while (<INFILE>) > > { > > if (not $seen{$_}) > > { > > $seen{$_} = 1; > > print OUTFILE; > > } > > else > > { > > # this record is a duplicate > > } > > } > > > > I've gone so far as to manipulate $_ so as to remove \W > > globally and make > > it uppercase (so that Mr Zorkoff and MR ZORKOFF gets flagged as a > > duplicate). This works well also. > > > > What I'd like to do now next, if it's possible, is to catch > > items like the > > following: > > Mr. Tom Zorkoff 123 Elm St. NE Chicago, Illinois > > Mr. T. Zorkoff 123 Elm Street North East, Chicago, IL > > > > To do this, I'm hoping there is a way I can calculate the number of > > differences between line 1 and line 2 and use a percentage to > > determine if > > it should be considered unique or not. > > > > I don't understand how (or if) $_ could tell that $seen{$_} > > is a potential > > candidate, then go so far as counting the differences between the two. > > > > My thought is that I could get a scalar value for a > > substitution (i.e.: > > $result = () = $_ =~ /$seen{$_}/g; or something like that??), but I'm > > afraid that as soon as there is a single difference between the two, > > $result will be 0 (false). > > > > Am I barking up a beanstalk??? I hope this makes sense. Any > > and all help is > > greatly appreciated. > > Thanks, > > Carl > > > > > > -- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > >This message may contain information which is private, privileged or >confidential and is intended solely for the use of the individual or entity >named in the message. If you are not the intended recipient of this message, >please notify the sender thereof and destroy / delete the message. Neither >the sender nor Sappi Limited (including its subsidiaries and associated >companies) shall incur any liability resulting directly or indirectly from >accessing any of the attached files which may contain a virus or the like. > >-- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED]