On Fri, Feb 15, 2002 at 01:21:33PM -0500, John A.Walsh wrote: > Hello, > > I can't get character classes in regular experession to work with > Unicode characters. I've tried both putting both the literal Unicode > characters and the \x{XX} notation within square brackets [] to create > a character class, but it's not working. I've tried with both the > developer release of Perl 5.7.2 and the daily build from 2002/02/13. > > Here's an example of some code that isn't working for me: > --- > #!/usr/local/bin/perl5.7.2 > use Encode; > use utf8;
Rule #1: Do not use "use utf8". It's irrelevant. Amendment: "use utf8" is useful in one case and one case only-- if you *script* is in UTF-8, you can say "use utf8" and then use UTF-8 in places like variable and subroutine names. (Now I'm talking Perl 5.7. In Perl 5.6 it was different.) > $string = encode_utf8("f\x{e9}lise"); encode_utf8() will correctly transform the \x{e9} in the UTF-8 bytes \x{c3}\x{a9}. > $string =~ s/f[e\x{e8}\x{e9}\x{ea}\x{eb}]lise/SUCCESS/; #does not match It does not because you no more have the byte \x{e8} in your $string, you have its UTF-8 bytes \x{c3}\x{a9}. > print "new string: $string\n"; > --- > > With another approach, this works: > > #!/usr/local/bin/perl5.7.2 > use Encode; > use utf8; > > $string = encode_utf8("f\x{e9}lise"); > $regex = encode_utf8("f\x{e9}lise"); > $string =~ s/$regex/SUCCESS/; #matches This works because now the byte sequences match. > print "new string: $string\n"; > > While this does not: > > #!/usr/local/bin/perl5.7.2 > use Encode; > use utf8; > > $string = encode_utf8("f\x{e9}lise"); > $regex = encode_utf8("f[\x{e9}\x{e8}]lise"); You shouldn't convert regular expressions with encode_utf8(). What happens now is that the character class in the $regex gets to contain three bytes: \x{c3} (twice), \x{a9}, and \x{a8}. > $string =~ s/$regex/SUCCESS/; #does not match > print "new string: $string\n"; > > Should examples 1 and 3 be working? Thanks for listening. In all three examples you weren't actually using Unicode from Perl's perspective. You were converting 8-bit encoding bytes to UTF-8 bytes. You can take a peek at "perluniintro", which is a new document (after 5.7.2), hopefully clarifying things a bit. http://www.iki.fi/jhi/perluniintro.pod Some of the features it talks only work in post-5.7.2 Perl, but most of the 'theory' should be applicable to 5.7.2. > John > | John A. Walsh, Manager, Electronic Text Technologies > | Digital Library Program / University Information Technology Services (UITS) > | Indiana University, 1320 East Tenth Street, Bloomington, IN 47405 > | Voice:812-855-8758 Fax:812-856-2062 <mailto:[EMAIL PROTECTED]> -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen