On Wed, Jan 14, 2004 at 08:31:16PM +0100, Brian Foster wrote:
>  yes there is.  if the illegal 5-byter has the first
>  4-bytes legal followed by an US-ASCII byte (which is
>  what makes the 5-byter illegal), a parser that never
>  considers sequences longer than 4-bytes will see an
>  illegal sequence of 4-bytes and then a valid byte.

That would be correct: if a byte that was expected to be a continuation
byte is not, the UTF-8 string should be considered invalid and the
character that was just read should start a new sequence.  A 5-byte
sequence with the fifth byte invalid:

fb bf bf bf 41

should be parsed as an invalid sequence, followed by 0x41 ('A').  (That's
only sensible; on many media, lost bytes are much more common than bit
errors.)

Looking at

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

3.3.4: if it was parsed as you suggest, then the ASCII quote after the
partial sequence would be considered part of the sequence, and not
displayed.

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to