Brian Foster wrote on 2004-01-14 19:31 UTC: > | Continuing characters always begin with binary "10". There is no chance > | for an illegal 5 byte sequence to be mistaken for an illegal 4byte > | sequence followed by an ascii character. > > yes there is. if the illegal 5-byter has the first > 4-bytes legal followed by an US-ASCII byte (which is > what makes the 5-byter illegal), a parser that never > considers sequences longer than 4-bytes will see an > illegal sequence of 4-bytes and then a valid byte.
No there is not. A malformed UTF-8 sequence can *never* contain an ASCII byte, because that ASCII byte is always terminating any malformed sequence that might precede it. Any ASCII character must resynchronize the decoder and will then be interpreted correctly as an ASCII character. If your UTF-8 decoder does not resynchronize correctly, you may be in serious security troubles. You demonstrated a quite common (and from a security-point very dangerous) misunderstanding of how a UTF-8 decoder is supposed to work. If you ever wrote a UTF-8 decoder, please do test it thoroughly with http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt which contains all the boundary cases where one might make a mistake when implementing a UTF-8 decoder. Markus -- Markus Kuhn, Computer Lab, Univ of Cambridge, GB http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
