Re: Regex problem with accented characters

Mumia W. Tue, 27 Mar 2007 04:24:18 -0800

On 03/27/2007 03:34 AM, Beginner wrote:

Hi,
I am trying to extract the iso code and country name from a 3 columntable (taken from en.wikipedia.org) and have noticed a problem withaccented characters such as Ô.
Below is my script and a sample of the data I am using. When I runthe script the code beginning CI for Côte d'Ivoire returns the string
"CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
Does anyone know why \w+ does include Côte d'Ivoire and how I can getaround it in future?
TIA,
Dp.


==== extract.pl ========
#!/usr/bin/perl

use strict;
use warnings;

my $file = 'iso-alpha2.txt';

open(FH,$file) or die "Can't open $file: $!\n";
while (<FH>) {
        chomp;
        next if ($_ !~ /^\w{2}\s+/);
my ($code,$name) = ($_ =~/^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/);
        print "$code\t$name\n";
}
===============

======== sample data ========
...snip
BY      Belarus         Previously named "Byelorussian S.S.R."
BZ      Belize  
CA      Canada  
CC      Cocos (Keeling) Islands         
CD Congo, the Democratic Republic of the Previously named "Zaire"ZR
CF      Central African Republic        
CG      Congo   
CH Switzerland Code taken from "Confoederatio Helvetica", itsofficial Latin name
CI      Côte d'Ivoire   
CK      Cook Islands    
CL      Chile   
CM Cameroon===========

It's partly the encoding. Put «use encoding "iso-8859-1";» at the top ofyour program, and there will be a little improvement. However, that onlygets you as far as "Côte d"; I doubt there is any encoding whereapostrophe is in \w.

It's probably best to create an expression that contains all of thecharacters you may want. That would include accented characters and theapostrophe in this case.

Also, I advise you to use an programmer's editor that supports syntaxhighlighting. My VIM shows me that you missed the backslash that issupposed to be on the fourth "\s" in your regular expression.




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Regex problem with accented characters

Reply via email to