| Date: Tue, 13 Jan 2004 23:02:57 -0500
| From: srintuar <[EMAIL PROTECTED]>
|
| >Consider: parser 1 knows that a UTF-8 sequence can have
| >at most 6 bytes, and sees an illegal 5-byte sequence.
| >
| >Parser 2 knows that a UTF-8 sequence can have at most
| >4 bytes, and sees an illegal 4-byte sequence followed by
| >an ASCII symbol.
| >
| >Difference in interpretation of a byte sequence always has
| >security implications.
|
| Continuing characters always begin with binary "10". There is no chance
| for an illegal 5 byte sequence to be mistaken for an illegal 4byte
| sequence followed by an ascii character.
yes there is. if the illegal 5-byter has the first
4-bytes legal followed by an US-ASCII byte (which is
what makes the 5-byter illegal), a parser that never
considers sequences longer than 4-bytes will see an
illegal sequence of 4-bytes and then a valid byte.
as is said below, you *must* parse sequences of 5-
and 6- bytes, even if you then declare the codepoint
encoded by an otherwise-legal sequence invalid for
some reason and throw it away (or whatever).
cheers!
-blf-
| I personally dont think that UTF-8 parsers should bother to enforce the
| limit, and should deal with any valid utf-8 sequence up to six bytes
| long. (anymore than UCS-4 "parsers" should scan all strings over for
| high words)
|
| If someone wants to make a pass over a unicode string looking to limit
| and validate the ranges of the codepoints used that should be a separate
| consideration. That is the time at which to consider whether all or part
| of the text should be stricken, ignored, refused, cleaned up, or
| otherwise handled.
--
ÂHow many surrealists does it take to | Brian Foster Montpellier,
change a lightbulb? Three. One calms | [EMAIL PROTECTED] France
the warthog, and two fill the bathtub | Stop E$$o (ExxonMobile)!
with brightly-colored machine tools. | http://www.stopesso.com
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/