1. 'Ill-formed' UTF-8 ===================== The Unicode Standard specifies that any UTF-8 sequence that does not correspond to this table is 'ill-formed':
Code Points | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte | -------------------+----------+----------+----------+----------+ U+0000..U+007F | 00..7F | -- | -- | -- | U+0080..U+07FF | C2..DF | 80..BF | -- | -- | U+0800..U+0FFF | E0 | A0..BF | 80..BF | -- | U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | -- | U+D000..U+D7FF | ED | 80..9F | 80..BF | -- | U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | -- | U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF | U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF | Note in particular that: - anything beyond U+10FFFF is ill-formed. - anything U+D800..U+DFFF is ill-formed. - only one encoding for each Code Point is well-formed. We'd expect UTF-8 decode to spot ill-formed sequences. Though some special handling of incomplete sequences at the end of a buffer would be handy. We'd expect UTF-8 encode to only generate well-formed sequences. 2. Extended Sequences ===================== Unicode and ISO/IEC 10646:2003 define meanings for UTF-8 compatible sequences up to 6 bytes, which allows for characters up to 0x7FFF_FFFF. The Unicode reference code for reading UTF-8 recognises these extended sequences as being single entities (though ill-formed). Perl has its own further 7 and 13 byte forms, allowing for characters up to 0xF_FFFF_FFFF and 2^72-1, respectively. These are beyond UTF-8. 3. Non-Characters ================= The only other cause for concern are non-characters. These are: * U+FFFE and U+FFFF and the last two code points in every other Unicode plane. Unicode code space is divided into 17 'planes' of 65,536 characters, each. So characters U+01_FFFE, U+01_FFFF, U+02_FFFE, U+02_FFFF, ... U+10_FFFE and U+10_FFFF are all non-characters. * U+FDD0..U+FDEF Now, Unicode 5.0.0 says: "Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text." "Noncharacter code points are reserved for internal use, such as for sentinel values. They should never be interchanged. They do, however, have well-formed representations in Unicode encoding forms and survive conversions between encoding forms. This allows sentinel values to be preserved internally across Unicode encoding forms, even though they are not designed to be used in open interchange." So... this is not so clear-cut. For "open interchange" UTF-8 should disallow the non-characters. However, for local storage of Unicode stuff, non-characters should be allowed. 4. What 'UTF-8' Does ==================== Ill-formed sequences -- fine (mostly): * UTF-8 decode treats these as errors, and will stop or use fallback decoding as required. The default fallback is: - errors for sequence <= 0x7FFF_FFFF -- replaced by U+FFFD *** information is being lost, here :-( - anything else: each byte which is not recognised as being part of a complete 2..6 byte sequence is replaced by U+FFFD *** so one cannot distinguish ill-formed sequences from out of range characters. The PERLQQ, HTMLCREF and XMLCREF fallbacks are: - errors for sequence <= 0x7FFF_FFFF -- replaced by the respective escape sequence for the character value. This ought to work if the data is HTML or XML, where new escape sequences fit right in if HTMLCREF or XMLCREF is used. *** PERLQQ, however, may fail if '\' appears in the input and the sender has not escaped it ! Perhaps PERLQQ should escape '\' that appear in the input ? *** In all cases, however, all that's been achieved is that non-UTF-8 characters have been transliterated. It's still a puzzle what may be done with these characters ! - anything else: each byte which is not recognised as being part of a complete up to 6 byte sequence is replaced by the respective escape sequence for the byte value. *** this is impossible to distinguish from escaped values which could exist in the input ! * UTF-8 encode will not generate ill-formed sequences and treats out of ranges character values as errors. Errors will stop encoding or cause the fallback encoding to be used. The default fallback is: - errored characters <= 0x7FFF_FFFF -- replaced by U+FFFD *** Not much one can do here. It's not clear that U+FFFD is a good thing to output -- one could argue for discarding this rubbish, instead ? - 0x8000_0000 and greater -- replaced by seven or thirteen U+FFFD, depending on the length of the Perl internal form !!! *** This is also more than a bit odd !! The PERLQQ, HTMLCREF and XMLCREF fallbacks are: - errored characters <= 0x7FFF_FFFF -- replaced by the respective escape sequence for the character value. This ought to work if the data is HTML or XML, where new escape sequences fit right in if HTMLCREF or XMLCREF is used. *** PERLQQ, however, may fail if '\' appears in the output and the sender has not escaped it ! . Perhaps PERLQQ should escape '\' that appear in the output ? *** In all cases, however, all that's been achieved is that non-UTF-8 characters have been transliterated. It's still a puzzle what may be done with these characters ! - 0x8000_0000 and greater -- replaced by the seven or thirteen bytes that comprise the Perl internal form, each as its respective escape sequence !!! *** This is also more than a bit odd !! Incomplete sequences -- fine, but not documented ! * UTF-8 decode generally treats these as ill-formed, as above. However, the STOP_AT_PARTIAL CHECK option will cause decode to stop, without error (so without invoking the fallback). Non-Character Values -- inconsistent and arguable !! As noted above, one can argue for two approaches here, depending on whether the data being en/decoded is internal or external. For internal data, non-characters are valid and should be preserved. For external data, non-characters should not be sent or received. One can debate whether they should be dropped or replaced or escaped. UTF-8 encode/decode recognise only U+FFFF as a non-character, and treat it as an error. *** This looks like a bug. If non-character values are to be treated as errors, I suggest all non-character values should be so treated. *** This caters only for external data exchange. The error handling is as for ill-formed sequences, see above. 5. Conclusion 'UTF-8' is broken =============================== * the non-character handling is incomplete. * it can be argued that there should be an option to accept/allow non- character values. * the various fallback options are all less than satisfactory in their own way. One can see why the ref:Sub CHECK argument was invented. HOWEVER: it would be handy if there was a second parameter passed to the CHECK subroutine, telling it *why* the given sequence cannot be encoded/decoded, in particular: -- out of range character value -- ill-formed sequence (and could pass in everything up to the next not invalid byte ?) -- non-character -- incomplete sequence for otherwise the subroutine has to do all the work to figure this out for itself ! ------------------------------------------------------------------------ It is clear that what data is valid, and how to deal with invalid data, is really up to the application. Trying to be helpful in Encode/Decode is apparently tricky. It is also clear that a lot of heavy duty character/byte bashing would be better if it could be provided in XS land. However, thinking about some simple but general mechanism for this is making my head hurt. [I'm going to go away now, and lie down.] -- Chris Hall highwayman.com
signature.asc
Description: PGP signature