Beginner wrote:
Hi,
I am trying to extract the iso code and country name from a 3 column
table (taken from en.wikipedia.org) and have noticed a problem with
accented characters such as Ô.
Below is my script and a sample of the data I am using. When I run
the script the code beginning CI for Côte d'Ivoire returns the string
"CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
Does anyone know why \w+ does include Côte d'Ivoire and how I can get
around it in future?
TIA,
Dp.
==== extract.pl ========
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'iso-alpha2.txt';
open(FH,$file) or die "Can't open $file: $!\n";
while (<FH>) {
chomp;
next if ($_ !~ /^\w{2}\s+/);
my ($code,$name) = ($_ =~
/^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/);
print "$code\t$name\n";
}
===============
======== sample data ========
...snip
BY Belarus Previously named "Byelorussian S.S.R."
BZ Belize
CA Canada
CC Cocos (Keeling) Islands
CD Congo, the Democratic Republic of the Previously named "Zaire"
ZR
CF Central African Republic
CG Congo
CH Switzerland Code taken from "Confoederatio Helvetica", its official
Latin name
CI Côte d'Ivoire
CK Cook Islands
CL Chile
CM Cameroon
===========
Ordinarily the range of characters mapped by \w is limited to [0-9A-Za-z_].
However, if you put 'use locale' at the start of your program this will be
extended to include the accented alpha characters as well (see perldoc
perllocale).
However, this will still not solve your problem, as the apostrophe in
"Côte d'Ivoire" will still not match \w and you will end up with
"CI\tCôte d". I suggest you change your regex to simply match any
character at all up to the end of the line, like this:
while (<FH>) {
chomp;
next unless /^(\w\w)\s+(.+?)\s*$/;
my ($code, $name) = ($1, $2);
print "$code\t$name\n";
}
which will give the result you desire.
But you still have the problem that the line for Zaire has no text and
will not match the regex anyway!
Hope this helps.
Rob
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/