Re: Substituting malformed UTF-8 sequences in a decoder

Florian Weimer Tue, 25 Jul 2000 01:46:24 -0700
  Edmund GRIMLEY EVANS <[EMAIL PROTECTED]> writes:

> > B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence
> 
> This is what I do in Mutt. It's easy to implement and works for any
> multibyte encoding; the program doesn't have to know about UTF-8.

This is what I recommend at the moment, with two exceptions: For
UTF-8-to-UTF-16 translation, an UCS-4 character which can't be
represented in UTF-16 is replaced with a single replacement character.
This also applies to syntactically correct UTF-8 sequences which are
either overlong or encode code positions such as surrogates which are
forbidden in UTF-8.

> > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> >    UTF-8 sequence
> 
> Not much good if you're not converting to UTF-16.

Well, it works with UCS-4 as well (but I would use a private area for
this kind of stuff until it's generally accepted practice to do such
hacks with surrogates).

I think D) could be yet another translation method (in addition to
"error" and "replace"), but it shouldn't be the only one a UTF-8
library provides.  With method D), your UTF-8 *encoder* might create
an invalid UTF-8 stream, which is certainly not desirable for some
applications.

> It's unfortunate that the current UTF-8 stuff for Emacs causes
> malformed UTF-8 files to be silently trashed.

Yes, that's quite annoying.  But the whole MULE stuff is a big
mess.  In-band signalling everywhere. :-( (Some byte sequences in a
single-byte buffer do very strange things.)
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to