[EMAIL PROTECTED] wrote on 2004-01-11 16:53 UTC: > > That is not necessarily good advice in security issues. > > What harm can it be? It will not be characters that are relevant in any > syntactical analyses. > > Consider: parser 1 knows that a UTF-8 sequence can have > at most 6 bytes, and sees an illegal 5-byte sequence. > > Parser 2 knows that a UTF-8 sequence can have at most > 4 bytes, and sees an illegal 4-byte sequence followed by > an ASCII symbol. > > Difference in interpretation of a byte sequence always has > security implications.
Your example does not work, because an ASCII byte must always resynchronize the decoder and be recognized as an ASCII character, completely independent of whether the decoder knows about the existance of 6-byte UTF-8 sequences or treats all bytes in the range 0xf8..0xff as illegal. Bytes in the range 0x00..0x7f cannot be part of a malformed UTF-8 sequence. I have yet to see a scenario where the difference between 4-byte and 6-byte UTF-8 decoder could lead to a plausible security risk and I don't believe that one is easy to construct or likely to happen. Markus -- Markus Kuhn, Computer Lab, Univ of Cambridge, GB http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
