22 Saint John Street 22 St John St 22 Saint John St 22 St John Street
So, if you have addresses from multiple sources, and you want to find duplicates between (or within) those sources, you have to find a way to standardize the addresses (i.e. I believe the USPS standardizes on the abbreviated form rather than the expanded form).
You also have to parse the first and second addresses apart -- sometimes these are included on the same line one one of the "duplicates", while they might be separate entries ("address1" and "address2") on another potential duplicate. I have often seen datasets where city, state and zip are mistakenly included in the address1 or address2 line.
All of these issues make it difficult to take addresses from separate datasets and compare then to find duplicates.
We have used the following method for standardizing addresses -- it catches the majority of the standardization issues (though it does not check for address2 in address1, or the presence of city, state and zip.
sub standardize_address {
my $address = shift;if ( $address =~ /(.*)\sMt\.?\s(.*)/i ) { $address = $1 . ' Mount ' . $2 }
if ( $address =~ /(.*)\sNt?h?\.?\s(.*)/i ) { $address = $1 . ' North ' . $2 }
if ( $address =~ /(.*)\sSt?h?\.?\s(.*)/i ) { $address = $1 . ' South ' . $2 }
if ( $address =~ /(.*)\sE\.?\s(.*)/i ) { $address = $1 . ' East ' . $2 }
if ( $address =~ /(.*)\sW\.?\s(.*)/i ) { $address = $1 . ' West ' . $2 }
if ( $address =~ /(.*)\sU\.?\s(.*)/i ) { $address = $1 . ' Upper ' . $2 }
if ( $address =~ /(.*)\sL\.?\s(.*)/i ) { $address = $1 . ' Lower ' . $2 }
if ( $address =~ /(.*)p\.?\s?o\.? box\s(.*)/i ) { $address = $1 . 'P.O. Box ' . $2 }
if ( $address =~ /(.*)\sSt\b\.?(\s*.*)/i ) { $address = $1 . ' Street' . $2 }
if ( $address =~ /(.*)\sRd\b\.?(\s*.*)/i ) { $address = $1 . ' Road' . $2 }
if ( $address =~ /(.*)\sLa\b\.?(\s*.*)/i ) { $address = $1 . ' Lane' . $2 }
if ( $address =~ /(.*)\sAve\b\.?(\s*.*)/i ) { $address = $1 . ' Avenue' . $2 }
if ( $address =~ /(.*)\sHwy\b\.?(\s*.*)/i ) { $address = $1 . ' Highway' . $2 }
$address =~ s/\bDr\.?\b/Drive/ig; $address =~ s/\bDrive\./Drive/g;
$address =~ s/#//g;
return $address; }
HTH.
-Chris
John Saylor wrote:
hi
( 03.08.04 17:12 -0400 ) Joel Gwynn:
we're looking for a fast, customizable de-duping solution.
I was thinking there might be some perl stuff out there,
really, any perl programmer worth hiring should be able to do this while sleeping.
-- Chris Brooks VP, Technology carescout.com
STATEMENT OF CONFIDENTIALITY: The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, please notify CareScout immediately at either (800) 571-1918 or at [EMAIL PROTECTED], and destroy all copies of this message and any attachments.
_______________________________________________ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm

