Re: \w regular expressions unicode

Gunnar Hjalmarsson Wed, 22 Apr 2009 09:20:30 -0700

Stanisław T. Findeisen wrote:

Gunnar Hjalmarsson wrote:
Stanisław T. Findeisen wrote:
Hi how to write regular expressions matching against Unicode (eg.,UTF-8) strings?
For instance, in my regexp:

qr/^([.<>@ \w])*$/
Decode the UTF-8 encoded strings before applying the regex on them.

$ perl -MEncode -le '
$utf8_encoded = "smörgåsbord";
$s = decode "UTF-8", $utf8_encoded;
print "Match" if $s =~ /^\w+$/;
'
Match
$
Thanks, decode helped with this. But can I ask you one more question?What assumptions does Perl make regarding input file (i.e., theprogram/script file) encoding?

AFAIK, it just converts the bytes into Perl's internal format, but itdoes not assume anything (at least not by default) with respect to thecharacter encoding.

Is it so that string literals in Perl are byte arrays in fact?


String literals in a Perl script are byte *strings* until decoded.

What you type is what you get?


Not sure what you mean by that.

You may find http://perldoc.perl.org/perlunitut.html helpful.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: \w regular expressions unicode

Reply via email to