Re: Linux console UTF-8 by default

Brian Foster Wed, 14 Jan 2004 12:43:04 -0800

  | Date: Tue, 13 Jan 2004 23:02:57 -0500
  | From: srintuar <[EMAIL PROTECTED]>
  | 
  | >Consider: parser 1 knows that a UTF-8 sequence can have
  | >at most 6 bytes, and sees an illegal 5-byte sequence.
  | >
  | >Parser 2 knows that a UTF-8 sequence can have at most
  | >4 bytes, and sees an illegal 4-byte sequence followed by
  | >an ASCII symbol.
  | >
  | >Difference in interpretation of a byte sequence always has
  | >security implications.
  | 
  | Continuing characters always begin with binary "10". There is no chance
  | for an illegal 5 byte sequence to be mistaken for an illegal 4byte
  | sequence followed by an ascii character.


 yes there is.  if the illegal 5-byter has the first
 4-bytes legal followed by an US-ASCII byte (which is
 what makes the 5-byter illegal), a parser that never
 considers sequences longer than 4-bytes will see an
 illegal sequence of 4-bytes and then a valid byte.

 as is said below, you *must* parse sequences of 5-
 and 6- bytes, even if you then declare the codepoint
 encoded by an otherwise-legal sequence invalid for
 some reason and throw it away (or whatever).

cheers!
        -blf-

  | I personally dont think that UTF-8 parsers should bother to enforce the
  | limit, and should deal with any valid utf-8 sequence up to six bytes
  | long. (anymore than UCS-4 "parsers" should scan all strings over for
  | high words)
  | 
  | If someone wants to make a pass over a unicode string looking to limit
  | and validate the ranges of the codepoints used that should be a separate
  | consideration. That is the time at which to consider whether all or part
  | of the text should be stricken, ignored, refused, cleaned up, or
  | otherwise handled.
--
ÂHow many surrealists does it take to    |  Brian Foster      Montpellier,
 change a lightbulb?  Three.  One calms  |  [EMAIL PROTECTED]      France
 the warthog, and two fill the bathtub   |    Stop E$$o (ExxonMobile)!
 with brightly-colored machine tools.Â   |        http://www.stopesso.com

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Linux console UTF-8 by default

Reply via email to