On 03/27/2007 03:34 AM, Beginner wrote:
Hi,
I am trying to extract the iso code and country name from a 3 column
table (taken from en.wikipedia.org) and have noticed a problem with
accented characters such as Ô.
Below is my script and a sample of the data I am using. When I run
the script the code beginning CI for Côte d'Ivoire returns the string
"CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
Does anyone know why \w+ does include Côte d'Ivoire and how I can get
around it in future?
TIA,
Dp.
==== extract.pl ========
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'iso-alpha2.txt';
open(FH,$file) or die "Can't open $file: $!\n";
while (<FH>) {
chomp;
next if ($_ !~ /^\w{2}\s+/);
my ($code,$name) = ($_ =~
/^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/);
print "$code\t$name\n";
}
===============
======== sample data ========
...snip
BY Belarus Previously named "Byelorussian S.S.R."
BZ Belize
CA Canada
CC Cocos (Keeling) Islands
CD Congo, the Democratic Republic of the Previously named "Zaire"
ZR
CF Central African Republic
CG Congo
CH Switzerland Code taken from "Confoederatio Helvetica", its
official Latin name
CI Côte d'Ivoire
CK Cook Islands
CL Chile
CM Cameroon
===========
It's partly the encoding. Put «use encoding "iso-8859-1";» at the top of
your program, and there will be a little improvement. However, that only
gets you as far as "Côte d"; I doubt there is any encoding where
apostrophe is in \w.
It's probably best to create an expression that contains all of the
characters you may want. That would include accented characters and the
apostrophe in this case.
Also, I advise you to use an programmer's editor that supports syntax
highlighting. My VIM shows me that you missed the backslash that is
supposed to be on the fourth "\s" in your regular expression.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/