Re: Conversion-free switching between binary and character strings in Perl

Larry Wall Thu, 31 May 2007 13:29:53 -0700

On Thu, May 31, 2007 at 08:43:13PM +0100, Markus Kuhn wrote:
: Is there something special about $1 inside a s/.../.../ge expression
: that prevents the application of Encode::_utf8_on($1)?
: 
: Seems so, since
: 
: $s =~ 
s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a
 = $1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge;
: 
: does the trick.


Yes, in Perl 5 a magical variable like $1 is essentially a tied
reference into the middle of another string, and not a real value
in its own right, so when you read its value it copies out the
substring and ignores any flags you might have set on the original
scalar variable, since it thinks $1 is a read-only variable.  (And,
in fact, assigning to $1 complains about what it sees as an attempt
to modify a read-only variable, but _utf8_on() is not checking to
see if the scalar is considered writeable.)  But if it didn't simply
ignore the flag when copying out the value, you will have succeeded
in setting the utf8 flag for *all* $1 in your program, because Perl 5
only has one global $1 variable that interrogates the "current match"
every time you read it.

In theory this should all work better in Perl 6, where match variables
are properly lexically scoped, and $1 is just an alias into the list of
matches contained in the current match variable, so the identity of
each match can be preserved.  (Along with the fact that Perl 6 treats
byte strings and character strings as fundamentally different types
that must not be confused with each other.)

Larry

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Conversion-free switching between binary and character strings in Perl

Reply via email to