Re: Linux console UTF-8 by default

srintuar Tue, 13 Jan 2004 21:21:11 -0800

Consider: parser 1 knows that a UTF-8 sequence can have
at most 6 bytes, and sees an illegal 5-byte sequence.

Parser 2 knows that a UTF-8 sequence can have at most
4 bytes, and sees an illegal 4-byte sequence followed by
an ASCII symbol.

Difference in interpretation of a byte sequence always has security implications.

Continuing characters always begin with binary "10". There is no chance for an illegal 5 byte sequence to be mistaken for an illegal 4byte sequence followed by an ascii character.

A sequence can be illegal by virtue of being overcoded, and in the case of pedantic adherence to the 0x10FFFF ceiling, by being beyond the limits of UTF-16, but there is no chance that the beginning of a character sequence will be misinterpreted. (such a pedantic parser might see an invalid sequence begin byte, followed by 4 unattached continuing bytes.)

I personally dont think that UTF-8 parsers should bother to enforce the limit, and should deal with any valid utf-8 sequence up to six bytes long. (anymore than UCS-4 "parsers" should scan all strings over for high words)

If someone wants to make a pass over a unicode string looking to limit and validate the ranges of the codepoints used that should be a separate consideration. That is the time at which to consider whether all or part of the text should be stricken, ignored, refused, cleaned up, or otherwise handled.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Linux console UTF-8 by default

Reply via email to