Actually, if I understand what Joel was asking about, removing duplicates by address is a non-trivial task -- address data is notoriously dirty. What makes the job interesting is that there are a wide variety of abbreviations used in addresses -- for example:

22 Saint John Street
22 St John St
22 Saint John St
22 St John Street

So, if you have addresses from multiple sources, and you want to find duplicates between (or within) those sources, you have to find a way to standardize the addresses (i.e. I believe the USPS standardizes on the abbreviated form rather than the expanded form).

You also have to parse the first and second addresses apart -- sometimes these are included on the same line one one of the "duplicates", while they might be separate entries ("address1" and "address2") on another potential duplicate. I have often seen datasets where city, state and zip are mistakenly included in the address1 or address2 line.

All of these issues make it difficult to take addresses from separate datasets and compare then to find duplicates.

We have used the following method for standardizing addresses -- it catches the majority of the standardization issues (though it does not check for address2 in address1, or the presence of city, state and zip.

sub standardize_address {
   my $address = shift;

if ( $address =~ /(.*)\sMt\.?\s(.*)/i ) { $address = $1 . ' Mount ' . $2 }
if ( $address =~ /(.*)\sNt?h?\.?\s(.*)/i ) { $address = $1 . ' North ' . $2 }
if ( $address =~ /(.*)\sSt?h?\.?\s(.*)/i ) { $address = $1 . ' South ' . $2 }
if ( $address =~ /(.*)\sE\.?\s(.*)/i ) { $address = $1 . ' East ' . $2 }
if ( $address =~ /(.*)\sW\.?\s(.*)/i ) { $address = $1 . ' West ' . $2 }
if ( $address =~ /(.*)\sU\.?\s(.*)/i ) { $address = $1 . ' Upper ' . $2 }
if ( $address =~ /(.*)\sL\.?\s(.*)/i ) { $address = $1 . ' Lower ' . $2 }
if ( $address =~ /(.*)p\.?\s?o\.? box\s(.*)/i ) { $address = $1 . 'P.O. Box ' . $2 }


if ( $address =~ /(.*)\sSt\b\.?(\s*.*)/i ) { $address = $1 . ' Street' . $2 }
if ( $address =~ /(.*)\sRd\b\.?(\s*.*)/i ) { $address = $1 . ' Road' . $2 }
if ( $address =~ /(.*)\sLa\b\.?(\s*.*)/i ) { $address = $1 . ' Lane' . $2 }
if ( $address =~ /(.*)\sAve\b\.?(\s*.*)/i ) { $address = $1 . ' Avenue' . $2 }
if ( $address =~ /(.*)\sHwy\b\.?(\s*.*)/i ) { $address = $1 . ' Highway' . $2 }


   $address =~ s/\bDr\.?\b/Drive/ig;
   $address =~ s/\bDrive\./Drive/g;

$address =~ s/#//g;

   return $address;
}

HTH.

-Chris

John Saylor wrote:

hi

( 03.08.04 17:12 -0400 ) Joel Gwynn:


we're looking for a fast, customizable de-duping solution.
I was thinking there might be some perl stuff out there,



really, any perl programmer worth hiring should be able to do this while sleeping.




-- Chris Brooks VP, Technology carescout.com

STATEMENT OF CONFIDENTIALITY:
The information contained in this electronic message and any attachments
to this message are intended for the exclusive use of the addressee(s)
and may contain confidential or privileged information. If you are not
the intended recipient, please notify CareScout immediately at either
(800) 571-1918 or at [EMAIL PROTECTED], and destroy all copies of
this message and any attachments.


_______________________________________________ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to