Re: Regex problem with accented characters

Rob Dixon Tue, 27 Mar 2007 04:33:05 -0800

Beginner wrote:

Hi,
I am trying to extract the iso code and country name from a 3 columntable (taken from en.wikipedia.org) and have noticed a problem withaccented characters such as Ô.
Below is my script and a sample of the data I am using. When I runthe script the code beginning CI for Côte d'Ivoire returns the string
"CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
Does anyone know why \w+ does include Côte d'Ivoire and how I can getaround it in future?
TIA,
Dp.


==== extract.pl ========
#!/usr/bin/perl

use strict;
use warnings;

my $file = 'iso-alpha2.txt';

open(FH,$file) or die "Can't open $file: $!\n";
while (<FH>) {
        chomp;
        next if ($_ !~ /^\w{2}\s+/);
        my ($code,$name) = ($_ =~ 
/^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/);
        print "$code\t$name\n";
}
===============

======== sample data ========
...snip
BY      Belarus         Previously named "Byelorussian S.S.R."
BZ      Belize  
CA      Canada  
CC      Cocos (Keeling) Islands         
CD Congo, the Democratic Republic of the Previously named "Zaire"ZR
CF      Central African Republic        
CG      Congo   
CH      Switzerland     Code taken from "Confoederatio Helvetica", its official 
Latin name
CI      Côte d'Ivoire   
CK      Cook Islands    
CL      Chile   
CM Cameroon===========


Ordinarily the range of characters mapped by \w is limited to [0-9A-Za-z_].
However, if you put 'use locale' at the start of your program this will be
extended to include the accented alpha characters as well (see perldoc
perllocale).

However, this will still not solve your problem, as the apostrophe in
"Côte d'Ivoire" will still not match \w and you will end up with
"CI\tCôte d". I suggest you change your regex to simply match any
character at all up to the end of the line, like this:

 while (<FH>) {
   chomp;
   next unless /^(\w\w)\s+(.+?)\s*$/;
   my ($code, $name) = ($1, $2);
   print "$code\t$name\n";
 }

which will give the result you desire.

But you still have the problem that the line for Zaire has no text and
will not match the regex anyway!

Hope this helps.

Rob

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Regex problem with accented characters

Reply via email to