On Sun, 16 Mar 2008 I wrote >I thought 'C' worked on Octets ? Which is what 5.8.8 appears to be >doing, but not 5.10.0.
I apologise... I should have read the perldelta. I now understand that: * in v5.8.8 one would say unpack('C*', ...) and get the underlying octets, if the string was a 'wide' (UTF8) string. in v5.10.0 this is no longer possible, C in a 'wide' string returns the wide character value. * in v5.10.0, in unpack 'C' and 'W' are the same as each other. (At least, I cannot tell the difference.) I cannot imagine why. * in v5.10.0, in pack 'C' and 'W' are the same as each other, except that 'C' masks the value down to 0..255. Exchanging the meaning of 'U0' and 'C0' between 5.8.X and 5.10.0 was a stroke of genius. The pack v5.10.0 documentation says: "Pack and unpack can operate in two modes, character mode (C0 mode) where the packed string is processed per character and UTF-8 mode (U0 mode) where the packed string is processed in its UTF-8-encoded Unicode form on a byte by byte basis. Character mode is the default unless the format string starts with an U . You can switch mode at any moment with an explicit C0 or U0 in the format. A mode is in effect until the next mode switch or until the end of the ()-group in which it was entered." Where UTF-8 mode appears to mean the exact opposite of UTF8 in connection with the state of a string value. The given meaning for 'C' is "An unsigned char (octet) value" doesn't help clarify things :-( Anywho, v5.10.0 will: pack('C*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4) pack('W*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4) pack('U*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (wide:4) i.e. 'C' and 'W' produce 'byte' form strings, but 'U' produces a 'wide' character string. You might have thought that 'W' would generate a ('wide') character string... (but then it would be identical to 'U' !) pack('C0C*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4) pack('C0W*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4) pack('C0U*', 192,176,128,21) -> "\xC3\x80\xC2\xB0\xC2\x80\x15" (byte:7) so C0 prefix hasn't changed the result, except for C0U where we have been given a byte form string, containing the UTF-8. pack('U0C*', 192,176,128,21) -> "\x00\x00\x15" (wide:3) pack('U0W*', 192,176,128,21) -> "\x00\x00\x15" (wide:3) pack('U0U*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (wide:4) which does create a ('wide') character string. This makes no difference to 'U*', but for 'C*' and 'W*', the 'byte' string produced has been decoded as UTF-8, replacing bad sequences by \x00 ! I suppose this makes sense. Though one might have imagined this would mean that the result of C* & W* would be 'utf8::upgrade'd to 'wide' characters. pack('C*', 257,359,477,91) -> "\x01\x67\xDD\x5B" (byte:4) pack('W*', 257,359,477,91) -> "\x101\x167\x1DD\x5B" (wide:4) pack('U*', 257,359,477,91) -> "\x101\x167\x1DD\x5B" (wide:4) pack('C0C*', 257,359,477,91) -> "\x01\x67\xDD\x5B" (byte:4) pack('C0W*', 257,359,477,91) -> "\x{101}\x{167}\x{1DD}\x5B" (wide:4) pack('C0U*', 257,359,477,91) -> "\xC4\x81\xC5\xA7\xC7\x9D\x5B" (byte:7) So W* and U* produce the same 'wide' character string. But 'C0W*' and 'C0U*' are quite different. I have no idea why. Almost equally amusing: $p = pack('W', 248) -> "\xF8" (byte:1) unpack('C*') -> (0xF8) unpack('C0C*') -> (0xF8) unpack('U0C*') -> (0xC3, 0xB8) So, 'U0C' seems to utf8::upgrade($p) before unpacking the raw bytes. Whereas: $p = pack('W', 257) -> "\x{101}" (wide:1) unpack('C*') -> (0x101) unpack('C0C*') -> (0x101) unpack('U0C*') -> (0xC3, 0xB8) Now, one can argue that poking around inside how characters are encoded in wide strings is a Bad Thing. But 'U0...' is exposing just that. However, there is no straightforward way to unpack a string into raw bytes without testing first for whether the string is 'wide' or not: @b = unpack(utf8::is_utf8($s) ? 'U0C*' : 'C*', $s) ; Chris -- Chris Hall highwayman.com
signature.asc
Description: PGP signature