pugs-comm...@feather.perl6.nl writes: > +The C<utf8> type is derived from C<buf8>, with the additional constraint > +that it may only contain validly encoded UTF-8. Likewise, C<utf16> is > +derived from C<buf16>, and C<utf32> from C<buf32>.
What does "validly encoded UTF-8" mean in this context? The following questions come to mind: 1. Four-byte UTF-8 sequences are enough to handle any Unicode character. Are the obvious five- and six-byte extensions permitted? If so, how about a seven-byte extension (needed to allow any 32-bit value to be encoded)? Whichever sequence length is chosen, is there an additional constraint on the maximum permitted codepoint? For example, four-byte UTF-8 sequences can easily represent values up to 0x1f_ffff, but Unicode stops at 0x10_ffff. Or if seven-byte sequences are permitted, are codepoints limited to 2**32-1? 2. Are over-wide encoded sequences (0xC0 0x41 for U+0041, and so on) permitted? (I hope not.) 3. Are encoded codepoints corresponding to UTF-16 surrogates permitted? 4. Are noncharacter codepoints (0xFFFE, 0xFFFF, etc) permitted? 5. Are unallocated codepoints permitted? If so, that doesn't seem very "valid"; but if not, a program's behaviour might change under a newer version of Unicode. Perhaps programs should be given the opportunity to declare which Unicode version's list of allocated characters they want. 6. Are values that begin with combining characters permitted? Of those, question (3) applies to UTF-32, and questions (4), (5), and (6) to both UTF-16 and UTF-32. Further, a variant of (1) applies to UTF-32: are code units greater than 0x10FFFF permitted? I assume that the C<utf16> type forbids invalid surrogate sequences. I'm also tempted to suggest that the type names should be C<utf-8>, C<utf-16>, C<utf-32>. -- Aaron Crane ** http://aaroncrane.co.uk/