Re: Linux console UTF-8 by default

Markus Kuhn Sat, 17 Jan 2004 12:32:45 -0800

Brian Foster wrote on 2004-01-14 19:31 UTC:
>   | Continuing characters always begin with binary "10". There is no chance
>   | for an illegal 5 byte sequence to be mistaken for an illegal 4byte
>   | sequence followed by an ascii character.
> 
>  yes there is.  if the illegal 5-byter has the first
>  4-bytes legal followed by an US-ASCII byte (which is
>  what makes the 5-byter illegal), a parser that never
>  considers sequences longer than 4-bytes will see an
>  illegal sequence of 4-bytes and then a valid byte.


No there is not. A malformed UTF-8 sequence can *never* contain an ASCII
byte, because that ASCII byte is always terminating any malformed
sequence that might precede it. Any ASCII character must resynchronize
the decoder and will then be interpreted correctly as an ASCII
character. If your UTF-8 decoder does not resynchronize correctly, you
may be in serious security troubles.

You demonstrated a quite common (and from a security-point very
dangerous) misunderstanding of how a UTF-8 decoder is supposed to work.

If you ever wrote a UTF-8 decoder, please do test it thoroughly with

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

which contains all the boundary cases where one might make a mistake
when implementing a UTF-8 decoder.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Linux console UTF-8 by default

Reply via email to