Re: Character classes with Unicode

Jarkko Hietaniemi Fri, 15 Feb 2002 11:20:38 -0800

On Fri, Feb 15, 2002 at 01:21:33PM -0500, John A.Walsh wrote:
> Hello,
> 
> I can't get character classes in regular experession to work with
> Unicode characters.  I've tried both putting both the literal Unicode
> characters and the \x{XX} notation within square brackets [] to create
> a character class, but it's not working.  I've tried with both the
> developer release of Perl 5.7.2 and the daily build from 2002/02/13.
> 
> Here's an example of some code that isn't working for me:
> ---
> #!/usr/local/bin/perl5.7.2
> use Encode;
> use utf8;


Rule #1: Do not use "use utf8".  It's irrelevant.

        Amendment: "use utf8" is useful in one case and one case only--
        if you *script* is in UTF-8, you can say "use utf8" and then
        use UTF-8 in places like variable and subroutine names.

        (Now I'm talking Perl 5.7.  In Perl 5.6 it was different.)

> $string = encode_utf8("f\x{e9}lise");

encode_utf8() will correctly transform the \x{e9} in the UTF-8 bytes
\x{c3}\x{a9}.

> $string =~ s/f[e\x{e8}\x{e9}\x{ea}\x{eb}]lise/SUCCESS/; #does not match

It does not because you no more have the byte \x{e8} in your $string,
you have its UTF-8 bytes \x{c3}\x{a9}.

> print "new string: $string\n";
> ---
> 
> With another approach, this works:
> 
> #!/usr/local/bin/perl5.7.2
> use Encode;
> use utf8;
> 
> $string = encode_utf8("f\x{e9}lise");
> $regex = encode_utf8("f\x{e9}lise");
> $string =~ s/$regex/SUCCESS/; #matches

This works because now the byte sequences match.

> print "new string: $string\n";
> 
> While this does not:
> 
> #!/usr/local/bin/perl5.7.2
> use Encode;
> use utf8;
> 
> $string = encode_utf8("f\x{e9}lise");
> $regex = encode_utf8("f[\x{e9}\x{e8}]lise");

You shouldn't convert regular expressions with encode_utf8().
What happens now is that the character class in the $regex
gets to contain three bytes: \x{c3} (twice), \x{a9}, and \x{a8}.

> $string =~ s/$regex/SUCCESS/; #does not match
> print "new string: $string\n";
> 
> Should examples 1 and 3 be working?  Thanks for listening.

In all three examples you weren't actually using Unicode from
Perl's perspective.  You were converting 8-bit encoding bytes
to UTF-8 bytes.

You can take a peek at "perluniintro", which is a new document
(after 5.7.2), hopefully clarifying things a bit.

http://www.iki.fi/jhi/perluniintro.pod

Some of the features it talks only work in post-5.7.2 Perl, but
most of the 'theory' should be applicable to 5.7.2.

> John
> | John A. Walsh, Manager, Electronic Text Technologies
> | Digital Library Program / University Information Technology Services (UITS)
> | Indiana University, 1320 East Tenth Street, Bloomington, IN 47405
> | Voice:812-855-8758 Fax:812-856-2062 <mailto:[EMAIL PROTECTED]>

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Re: Character classes with Unicode

Reply via email to