Re: Character (or byte?) escapes under utf8 pragma

Michael Ludwig Mon, 08 Mar 2010 06:55:58 -0800

Hi Aristotle,

thanks for your answer - much appreciated! Please see my comments
inline.


Am 07.03.2010 um 07:39 schrieb Aristotle Pagaltzis:

> Perl does not distinguish between bytes and characters. It does
> distinguish between scalars that use a packed byte buffer for
> storage vs strings that use variable-width integer sequence for
> storage, but this is an implementation detail and does not mean
> anything in terms of semantics. Strings are simply strings in
> Perl. You cannot tell what kind of data they contain just by
> looking at them and the UTF8 flag doesn’t tell you either.

Okay. But unless I'm completely misled, you can tell whether a
string is supposed to contain characters (<- Encode::decode) or
bytes (<- Encode::encode). With the utf8 pragma in scope, it seems
to me that my literal strings are supposed to contain characters,
not bytes.

>    "\x{00a0}" does not map to utf8 at t.pl line 11.
>    <<\xA0Zurück
>    "\x{00a0}" does not map to utf8 at t.pl line 11.
>    <<\xA0Zurück
>    "\x{00a0}" does not map to utf8 at t.pl line 11.
>    <<\xA0Zurück
>    << Zurück
>    die now, somewhat counter-intuitively at t.pl line 15.
> 
> This is definitely a bug.

Good. It looked like one to me. Thanks for logging it with the
Perl maintainers.

However, it might already have been fixed for Perl 5.10.1 - at
least, ActiveState v5.10.1 produces what I think is a correct
result:


michael.lud...@nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl nbsp.pl 
<< Zurück
<< Zurück
<< Zurück
<< Zurück

michael.lud...@nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl -v

This is perl, v5.10.1 built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)


>> Am I mistaken in my expectation that while "\xa0" should be
>> a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that
>> perlretut(1) seems to support this assumption:
>> 
>> Unicode characters in the range of 128-255 use two hexadecimal
>> digits with braces: \x{ab}. Note that this is different than
>> \xab, which is just a hexadecimal byte with no Unicode
>> significance.
>> 
>> http://perl.active-venture.com/pod/perlretut-morecharacter.html
>> 
>> But maybe this only refers to these escapes inside regular expressions.
> 
> The documentation appears to be wrong. Unfortunately a lot of the
> documentation of Perl itself is wrong or confused about Perl’s
> string model.

The documentation I referred to is outdated. Sorry for that.

>> What's your advice for handling this situation more elegantly?
> 
> Use the \U escape to indicate that you always mean a Unicode code
> point. Due to other quirks in how \U is implemented, it ends up
> not triggering the bug that \x would.


How would I use that? I only know about the U specifier for pack:

my $smiley = pack 'U', 0x263a;

-- 
Michael.Ludwig (#) XING.com

Re: Character (or byte?) escapes under utf8 pragma

Reply via email to