On Thu, May 31, 2007 at 08:43:13PM +0100, Markus Kuhn wrote: : Is there something special about $1 inside a s/.../.../ge expression : that prevents the application of Encode::_utf8_on($1)? : : Seems so, since : : $s =~ s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a = $1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge; : : does the trick.
Yes, in Perl 5 a magical variable like $1 is essentially a tied reference into the middle of another string, and not a real value in its own right, so when you read its value it copies out the substring and ignores any flags you might have set on the original scalar variable, since it thinks $1 is a read-only variable. (And, in fact, assigning to $1 complains about what it sees as an attempt to modify a read-only variable, but _utf8_on() is not checking to see if the scalar is considered writeable.) But if it didn't simply ignore the flag when copying out the value, you will have succeeded in setting the utf8 flag for *all* $1 in your program, because Perl 5 only has one global $1 variable that interrogates the "current match" every time you read it. In theory this should all work better in Perl 6, where match variables are properly lexically scoped, and $1 is just an alias into the list of matches contained in the current match variable, so the identity of each match can be preserved. (Along with the fact that Perl 6 treats byte strings and character strings as fundamentally different types that must not be confused with each other.) Larry -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/