On Mon, Aug 19, 2002 at 12:54:24PM -0700, H. Peter Anvin wrote:
> One way is to treat each byte of a malformed sequence as a character
> (different from all real Unicode characters).  This is a mostly good
> approach, except that it allows the user to construct a valid UTF-8
> character out of malformed sequence escapes -- this may or may not be
> a problem in any particular application, but it needs to take into
> account, lest we get another instance of the overlong sequence
> problem.

That's what Vim does.  Malformed sequences show up as <HEX>, which
functions as a single character.

If the editor is 8-bit-clean, and you combine bytes that were parts of
invalid UTF-8 sequences such that you have a valid UTF-8 sequence, you
have a UTF-8 sequence; if I combine 0xC2 with 0xA9, it'd better write
those two bytes to disk, even though it happens to correspond to U+00A9;
doing anything else isn't 8-bit-clean.

I tested this, and that's exactly what happens; pasitng <A9> in front of
<C2> turns the pair into (C).

What could be done differently?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to