For convenience, I have test script source code in UTF-8.
The test also deals with non-breaking spaces, which I prefer
to keep as character references since they are not visible
and might be mistaken by the casual onlooker for ordinary
spaces. So I write them as "\xa0". Or "\x{a0}", or "\x{00a0}".

Now I find that they seem to be byte references, not character
references. Consider the following test script:


use strict;
use warnings;
use utf8; # source code in UTF-8 ("Zurück")
use open OUT => ':encoding(UTF-8)', ':std';

my $str1 = "<<\xa0Zurück\n";      # byte -> bad
my $str2 = "<<\x{a0}Zurück\n";    # should be character, but isn't
my $str3 = "<<\x{00a0}Zurück\n";  # ditto
my $str4 = "<<\xa0" . "Zurück\n"; # upgrading hack, works

print $str1, $str2, $str3, $str4;

$str1 ne $str2 and die "won't die";
$str1 ne $str3 and die "won't die";
$str1 ne $str4 and die 'die now, somewhat counter-intuitively';


The correct version of the string uses implicit upgrading of
the byte escape "\xa0" to a Unicode character. I've read upgrading
should rather be avoided, but here it does the job.

Am I mistaken in my expectation that while "\xa0" should be a byte,
"\x{a0}" and "\x{00a0}" should be characters? Note that perlretut(1)
seems to support this assumption:

  Unicode characters in the range of 128-255 use two hexadecimal
  digits with braces: \x{ab}. Note that this is different than \xab,
  which is just a hexadecimal byte with no Unicode significance.

http://perl.active-venture.com/pod/perlretut-morecharacter.html

But maybe this only refers to these escapes inside regular expressions.

Or maybe the utf8 pragma breaks things here? Don't think so, though.
If I comment it out, I have to recode my script to Latin1 in order for
the strings to be valid.

Note that the reason I use the utf8 pragma is so I can write "Zurück"
in my source code and automatically have Perl informed that these are
characters, not bytes - which is a great convenience.

Yeah, it would also work in Latin1, and our editors handle various
encodings just fine - but we have a good UTF-8 development environment
and there might be characters not representable in Latin1 that I'd like
to add to the script source.

What's your advice for handling this situation more elegantly?

-- 
Michael.Ludwig (#) XING.com

Reply via email to