Am 10.03.2010 um 11:02 schrieb Juerd Waalboer: > Michael Ludwig skribis 2010-03-10 10:34 (+0100): >> Okay. Let me try to see if I have understood correctly. Without the utf8 >> pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence >> of two bytes in my source code will be stored internally as a sequence >> of 12 integers. With the utf8 pragma in scope, only 11 integers.
I think I got confused about bytes and integers now, because I misread an earlier post by Aristoteles. What I meant is: With the utf8 pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence of two bytes in my source code will be stored internally as a sequence of 11 integers. (But I shouldn't care about the integers, that's an implementation detail.) Without the utf8 pragma in scope, the string will be stored as a sequence of 12 bytes; and 11 bytes if I convert the source to Latin-1. In the broken perl versions, like 5.8.9 and 5.10.0, with the utf8 pragma in scope I get the wrong sequence of 11 integers, as per your illustration quoted below: I get a0 where I should get c2-a0, because those perl versions don't handle character escapes correctly. > "so\xa0ein\xa0Käse" must be stored as either: > > l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off) > > or: > > u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on) Yes (modulo typo): so ein Käse: 73 6f c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 so?ein?Käse: 73 6f c2-a0 65 69 6e c2-a0 4b c3-83 c2-a4 73 65 ---- use common::sense; # includes utf8 pragma use open OUT => qw/:encoding(UTF-8) :std/; use Encode; sub show_bytes { my $str = shift; my $out = ''; for ( split '', $str ) { my $octets = Encode::encode( 'UTF-8', $_ ); $out .= join '-', map sprintf( '%x', ord), split '', $octets; $out .= ' '; } return $out; } print STDERR "Kaputt in Perl 5.8.9 und 5.10.0!\n"; # heile in 5.10.1 my $sok = "so\xa0ein\xa0Käse"; print $_, ":\t", show_bytes( $_ ), "\n" for $sok; ---- > Both strings should be semantically equal, and have 11 characters, each > of which has an integer ordinal value. > > What happens is the following: > > 73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on) > l1 l1 u8 > > This is wrong. It is a bug. Very graphical and palpable exposition, thanks! -- Michael.Ludwig (#) XING.com