On Mon, Aug 19, 2002 at 12:54:24PM -0700, H. Peter Anvin wrote: > One way is to treat each byte of a malformed sequence as a character > (different from all real Unicode characters). This is a mostly good > approach, except that it allows the user to construct a valid UTF-8 > character out of malformed sequence escapes -- this may or may not be > a problem in any particular application, but it needs to take into > account, lest we get another instance of the overlong sequence > problem.
That's what Vim does. Malformed sequences show up as <HEX>, which functions as a single character. If the editor is 8-bit-clean, and you combine bytes that were parts of invalid UTF-8 sequences such that you have a valid UTF-8 sequence, you have a UTF-8 sequence; if I combine 0xC2 with 0xA9, it'd better write those two bytes to disk, even though it happens to correspond to U+00A9; doing anything else isn't 8-bit-clean. I tested this, and that's exactly what happens; pasitng <A9> in front of <C2> turns the pair into (C). What could be done differently? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
