standardising spellings

Dermot Paikkos Thu, 24 Feb 2005 10:03:17 -0800

Hi,

I have a list of about 650 names (a small sample is below) that I 
need to import into a database. When you look at the list there are 
some obvious duplicates that are spelt slightly differently. I can 
rationalize some of the data with some simple substitutions but some 
of the data looks almost impossible to parse programmatically.
Here what I have done so far - it's not much:


#!/bin/perl
use strict;
my $file = "myfile.csv";
open(FH,$file) or die "Can't open file: $!\n";
while (<FH>) {
        chomp;
        s/&/and/;               # change & to and
        s/"//g;                 # remove any quotes
        s/ $//;                 # remove any trailing white space
        s/ \//\//;              # remove and space between slashes
        s/\/ /\//;              # ditto
        s/,$//;                 # remove any trailing commas
        print "$_\n";
}

Is there some other techniques that I can use to help standardise the 
list? I know I am going to have to look at the list manually and sort 
it but I thought there might be some way to give myself a head start.

If I could I would like to generate a csv file so that the first 
field contains the first appearance of a name and if there are any 
near hits these appear in the second and third fields. EG:
"Alan and Sandy Carey, Alan & Sandy Carey\n"
"Alan Carey, Alan D Carey\n" 

I know it's a tall order but does anyone have any ideas?
Thanx.
Dp.

FYI: Bachmann and Bachman are different people but I suspect William 
D. is also Bill Bachman. 

=== Sample data ==========
Alan and Sandy Carey
Alan & Sandy Carey
Alan Carey
Alan D. Carey
Leonard Lee Rue III 
Leonard Lessin 
"Leonard Lessin, FBPA "
Bill Bachman 
Bill Bachmann
William D. Bachman
Fred McConnaughey 
Frederica Georgia 
Frederick Ayer III 
Frederick R. McConnaughey
Greg Dimijian 
Gregory G. Dimijian 
"Gregory G. Dimijian, M.D. "  
Herve Donnezan 
Howard Uible 
Hubertus Kanus 
Inger McCabe Elliott 
Irene Vandermolen 
J. Gerard Smith 
J. L. G. Grande 
J. Water and A. Salic 
J. Waters and A. Salic 
Jack Fields 
Jack Rosen 
Daniel Bernstein 
Dan Bernstein 
David Schleser 
Kees van den Berg 
Kees Van Den Berg 

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

standardising spellings

Reply via email to