On Wed, Jan 14, 2004 at 08:31:16PM +0100, Brian Foster wrote:
> yes there is. if the illegal 5-byter has the first
> 4-bytes legal followed by an US-ASCII byte (which is
> what makes the 5-byter illegal), a parser that never
> considers sequences longer than 4-bytes will see an
> illegal sequence of 4-bytes and then a valid byte.
That would be correct: if a byte that was expected to be a continuation
byte is not, the UTF-8 string should be considered invalid and the
character that was just read should start a new sequence. A 5-byte
sequence with the fifth byte invalid:
fb bf bf bf 41
should be parsed as an invalid sequence, followed by 0x41 ('A'). (That's
only sensible; on many media, lost bytes are much more common than bit
errors.)
Looking at
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
3.3.4: if it was parsed as you suggest, then the ASCII quote after the
partial sequence would be considered part of the sequence, and not
displayed.
--
Glenn Maynard
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/