Hi,
I have a list of about 650 names (a small sample is below) that I
need to import into a database. When you look at the list there are
some obvious duplicates that are spelt slightly differently. I can
rationalize some of the data with some simple substitutions but some
of the data looks almost impossible to parse programmatically.
Here what I have done so far - it's not much:
#!/bin/perl
use strict;
my $file = "myfile.csv";
open(FH,$file) or die "Can't open file: $!\n";
while (<FH>) {
chomp;
s/&/and/; # change & to and
s/"//g; # remove any quotes
s/ $//; # remove any trailing white space
s/ \//\//; # remove and space between slashes
s/\/ /\//; # ditto
s/,$//; # remove any trailing commas
print "$_\n";
}
Is there some other techniques that I can use to help standardise the
list? I know I am going to have to look at the list manually and sort
it but I thought there might be some way to give myself a head start.
If I could I would like to generate a csv file so that the first
field contains the first appearance of a name and if there are any
near hits these appear in the second and third fields. EG:
"Alan and Sandy Carey, Alan & Sandy Carey\n"
"Alan Carey, Alan D Carey\n"
I know it's a tall order but does anyone have any ideas?
Thanx.
Dp.
FYI: Bachmann and Bachman are different people but I suspect William
D. is also Bill Bachman.
=== Sample data ==========
Alan and Sandy Carey
Alan & Sandy Carey
Alan Carey
Alan D. Carey
Leonard Lee Rue III
Leonard Lessin
"Leonard Lessin, FBPA "
Bill Bachman
Bill Bachmann
William D. Bachman
Fred McConnaughey
Frederica Georgia
Frederick Ayer III
Frederick R. McConnaughey
Greg Dimijian
Gregory G. Dimijian
"Gregory G. Dimijian, M.D. "
Herve Donnezan
Howard Uible
Hubertus Kanus
Inger McCabe Elliott
Irene Vandermolen
J. Gerard Smith
J. L. G. Grande
J. Water and A. Salic
J. Waters and A. Salic
Jack Fields
Jack Rosen
Daniel Bernstein
Dan Bernstein
David Schleser
Kees van den Berg
Kees Van Den Berg
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>