Re: Pack and Unpack are Broken for > 0x7FFF_FFFF (in 5.10.0)

Chris Hall Sun, 16 Mar 2008 11:54:25 -0700

On Sun, 16 Mar 2008 I wrote
>I thought 'C' worked on Octets ?  Which is what 5.8.8 appears to be
>doing, but not 5.10.0.


I apologise...  I should have read the perldelta.

I now understand that:

  * in v5.8.8 one would say unpack('C*', ...) and get the underlying
    octets, if the string was a 'wide' (UTF8) string.

    in v5.10.0 this is no longer possible, C in a 'wide' string
    returns the wide character value.

  * in v5.10.0, in unpack 'C' and 'W' are the same as each other.
    (At least, I cannot tell the difference.)

    I cannot imagine why.

  * in v5.10.0, in pack 'C' and 'W' are the same as each other, except
    that 'C' masks the value down to 0..255.

Exchanging the meaning of 'U0' and 'C0' between 5.8.X and 5.10.0 was a
stroke of genius.

The pack v5.10.0 documentation says:

 "Pack and unpack can operate in two modes, character mode (C0 mode)
  where the packed string is processed per character and UTF-8 mode
  (U0 mode) where the packed string is processed in its UTF-8-encoded
  Unicode form on a byte by byte basis. Character mode is the default
  unless the format string starts with an U . You can switch mode at
  any moment with an explicit C0 or U0 in the format. A mode is in
  effect until the next mode switch or until the end of the ()-group
  in which it was entered."

Where UTF-8 mode appears to mean the exact opposite of UTF8 in
connection with the state of a string value.  The given meaning for 'C'
is "An unsigned char (octet) value" doesn't help clarify things :-(

Anywho, v5.10.0 will:

  pack('C*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4)
  pack('W*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4)
  pack('U*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (wide:4)

     i.e. 'C' and 'W' produce 'byte' form strings, but 'U' produces
     a 'wide' character string.

     You might have thought that 'W' would generate a ('wide') character
     string...  (but then it would be identical to 'U' !)

  pack('C0C*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4)
  pack('C0W*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (byte:4)
  pack('C0U*', 192,176,128,21) -> "\xC3\x80\xC2\xB0\xC2\x80\x15"
                                                     (byte:7)

     so C0 prefix hasn't changed the result, except for C0U where we
     have been given a byte form string, containing the UTF-8.

  pack('U0C*', 192,176,128,21) -> "\x00\x00\x15" (wide:3)
  pack('U0W*', 192,176,128,21) -> "\x00\x00\x15" (wide:3)
  pack('U0U*', 192,176,128,21) -> "\xC0\xB0\x80\x15" (wide:4)

     which does create a ('wide') character string.

     This makes no difference to 'U*', but for 'C*' and 'W*', the 'byte'
     string produced has been decoded as UTF-8, replacing bad
     sequences by \x00 !

     I suppose this makes sense.

     Though one might have imagined this would mean that the result of
     C* & W* would be 'utf8::upgrade'd to 'wide' characters.

  pack('C*',   257,359,477,91) -> "\x01\x67\xDD\x5B" (byte:4)
  pack('W*',   257,359,477,91) -> "\x101\x167\x1DD\x5B" (wide:4)
  pack('U*',   257,359,477,91) -> "\x101\x167\x1DD\x5B" (wide:4)

  pack('C0C*', 257,359,477,91) -> "\x01\x67\xDD\x5B" (byte:4)
  pack('C0W*', 257,359,477,91) -> "\x{101}\x{167}\x{1DD}\x5B" (wide:4)
  pack('C0U*', 257,359,477,91) -> "\xC4\x81\xC5\xA7\xC7\x9D\x5B"
                                                              (byte:7)

     So W* and U* produce the same 'wide' character string.

     But 'C0W*' and 'C0U*' are quite different.

     I have no idea why.

Almost equally amusing:

  $p = pack('W', 248) -> "\xF8" (byte:1)

  unpack('C*')   -> (0xF8)
  unpack('C0C*') -> (0xF8)
  unpack('U0C*') -> (0xC3, 0xB8)

So, 'U0C' seems to utf8::upgrade($p) before unpacking the raw bytes.

Whereas:

  $p = pack('W', 257) -> "\x{101}" (wide:1)

  unpack('C*')   -> (0x101)
  unpack('C0C*') -> (0x101)
  unpack('U0C*') -> (0xC3, 0xB8)

Now, one can argue that poking around inside how characters are encoded
in wide strings is a Bad Thing.  But 'U0...' is exposing just that.

However, there is no straightforward way to unpack a string into raw
bytes without testing first for whether the string is 'wide' or not:

  @b = unpack(utf8::is_utf8($s) ? 'U0C*' : 'C*', $s) ;

Chris
-- 
Chris Hall               highwayman.com

signature.asc
Description: PGP signature

Re: Pack and Unpack are Broken for > 0x7FFF_FFFF (in 5.10.0)

Reply via email to