Continuing characters always begin with binary "10". There is no chance for an illegal 5 byte sequence to be mistaken for an illegal 4byte sequence followed by an ascii character.Consider: parser 1 knows that a UTF-8 sequence can have at most 6 bytes, and sees an illegal 5-byte sequence.
Parser 2 knows that a UTF-8 sequence can have at most 4 bytes, and sees an illegal 4-byte sequence followed by an ASCII symbol.
Difference in interpretation of a byte sequence always has
security implications.
A sequence can be illegal by virtue of being overcoded, and in the case of pedantic adherence to the 0x10FFFF ceiling, by being beyond the limits of UTF-16, but there is no chance that the beginning of a character sequence will be misinterpreted. (such a pedantic parser might see an invalid sequence begin byte, followed by 4 unattached continuing bytes.)
I personally dont think that UTF-8 parsers should bother to enforce the limit, and should deal with any valid utf-8 sequence up to six bytes long. (anymore than UCS-4 "parsers" should scan all strings over for high words)
If someone wants to make a pass over a unicode string looking to limit and validate the ranges of the codepoints used that should be a separate consideration. That is the time at which to consider whether all or part of the text should be stricken, ignored, refused, cleaned up, or otherwise handled.
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
