On Sat, Jan 17, 2004 at 02:27:43PM -0500, Edward H Trager wrote: > > > > On Sat, 17 Jan 2004, Markus Kuhn wrote: > > > [EMAIL PROTECTED] wrote on 2004-01-11 16:53 UTC: > > > > That is not necessarily good advice in security issues. > > > > > > What harm can it be? It will not be characters that are relevant in any > > > syntactical analyses. > > > > > > Consider: parser 1 knows that a UTF-8 sequence can have > > > at most 6 bytes, and sees an illegal 5-byte sequence. > > > > > > Parser 2 knows that a UTF-8 sequence can have at most > > > 4 bytes, and sees an illegal 4-byte sequence followed by > > > an ASCII symbol. > > > > > > Difference in interpretation of a byte sequence always has > > > security implications. > > > > Your example does not work, because an ASCII byte must always > > resynchronize the decoder and be recognized as an ASCII character, > > completely independent of whether the decoder knows about the existance > > of 6-byte UTF-8 sequences or treats all bytes in the range 0xf8..0xff as > > illegal. Bytes in the range 0x00..0x7f cannot be part of a malformed > > UTF-8 sequence. > > > > I have yet to see a scenario where the difference between 4-byte and > > 6-byte UTF-8 decoder could lead to a plausible security risk and I don't > > believe that one is easy to construct or likely to happen. > > > > Hi, Markus, > > Then I assume you would advocate that UTF-8 encoders/decoders (for example > for Linux) be written to handle all 6 bytes, not just the four which is > probably the case now?
Well, probably many UTF-8 encoders/decoders are 6 bytes, as it is only recently that the 4 byte restriction was defined for Unicode UTF-8. ISO 10646 UTF-8 was 6 bytes from its introduction about 10 years ago. Best regards Keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
