For convenience, I have test script source code in UTF-8. The test also deals with non-breaking spaces, which I prefer to keep as character references since they are not visible and might be mistaken by the casual onlooker for ordinary spaces. So I write them as "\xa0". Or "\x{a0}", or "\x{00a0}".
Now I find that they seem to be byte references, not character references. Consider the following test script: use strict; use warnings; use utf8; # source code in UTF-8 ("Zurück") use open OUT => ':encoding(UTF-8)', ':std'; my $str1 = "<<\xa0Zurück\n"; # byte -> bad my $str2 = "<<\x{a0}Zurück\n"; # should be character, but isn't my $str3 = "<<\x{00a0}Zurück\n"; # ditto my $str4 = "<<\xa0" . "Zurück\n"; # upgrading hack, works print $str1, $str2, $str3, $str4; $str1 ne $str2 and die "won't die"; $str1 ne $str3 and die "won't die"; $str1 ne $str4 and die 'die now, somewhat counter-intuitively'; The correct version of the string uses implicit upgrading of the byte escape "\xa0" to a Unicode character. I've read upgrading should rather be avoided, but here it does the job. Am I mistaken in my expectation that while "\xa0" should be a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that perlretut(1) seems to support this assumption: Unicode characters in the range of 128-255 use two hexadecimal digits with braces: \x{ab}. Note that this is different than \xab, which is just a hexadecimal byte with no Unicode significance. http://perl.active-venture.com/pod/perlretut-morecharacter.html But maybe this only refers to these escapes inside regular expressions. Or maybe the utf8 pragma breaks things here? Don't think so, though. If I comment it out, I have to recode my script to Latin1 in order for the strings to be valid. Note that the reason I use the utf8 pragma is so I can write "Zurück" in my source code and automatically have Perl informed that these are characters, not bytes - which is a great convenience. Yeah, it would also work in Latin1, and our editors handle various encodings just fine - but we have a good UTF-8 development environment and there might be characters not representable in Latin1 that I'd like to add to the script source. What's your advice for handling this situation more elegantly? -- Michael.Ludwig (#) XING.com