Hi Aristotle, thanks for your answer - much appreciated! Please see my comments inline.
Am 07.03.2010 um 07:39 schrieb Aristotle Pagaltzis: > Perl does not distinguish between bytes and characters. It does > distinguish between scalars that use a packed byte buffer for > storage vs strings that use variable-width integer sequence for > storage, but this is an implementation detail and does not mean > anything in terms of semantics. Strings are simply strings in > Perl. You cannot tell what kind of data they contain just by > looking at them and the UTF8 flag doesn’t tell you either. Okay. But unless I'm completely misled, you can tell whether a string is supposed to contain characters (<- Encode::decode) or bytes (<- Encode::encode). With the utf8 pragma in scope, it seems to me that my literal strings are supposed to contain characters, not bytes. > "\x{00a0}" does not map to utf8 at t.pl line 11. > <<\xA0Zurück > "\x{00a0}" does not map to utf8 at t.pl line 11. > <<\xA0Zurück > "\x{00a0}" does not map to utf8 at t.pl line 11. > <<\xA0Zurück > << Zurück > die now, somewhat counter-intuitively at t.pl line 15. > > This is definitely a bug. Good. It looked like one to me. Thanks for logging it with the Perl maintainers. However, it might already have been fixed for Perl 5.10.1 - at least, ActiveState v5.10.1 produces what I think is a correct result: michael.lud...@nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl nbsp.pl << Zurück << Zurück << Zurück << Zurück michael.lud...@nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl -v This is perl, v5.10.1 built for darwin-thread-multi-2level (with 2 registered patches, see perl -V for more detail) >> Am I mistaken in my expectation that while "\xa0" should be >> a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that >> perlretut(1) seems to support this assumption: >> >> Unicode characters in the range of 128-255 use two hexadecimal >> digits with braces: \x{ab}. Note that this is different than >> \xab, which is just a hexadecimal byte with no Unicode >> significance. >> >> http://perl.active-venture.com/pod/perlretut-morecharacter.html >> >> But maybe this only refers to these escapes inside regular expressions. > > The documentation appears to be wrong. Unfortunately a lot of the > documentation of Perl itself is wrong or confused about Perl’s > string model. The documentation I referred to is outdated. Sorry for that. >> What's your advice for handling this situation more elegantly? > > Use the \U escape to indicate that you always mean a Unicode code > point. Due to other quirks in how \U is implemented, it ends up > not triggering the bug that \x would. How would I use that? I only know about the U specifier for pack: my $smiley = pack 'U', 0x263a; -- Michael.Ludwig (#) XING.com