If I correctly understand the thread that's just been discussed on this
list, starting at:
http://mail.nl.linux.org/linux-utf8/2007-04/msg00050.html
then from now on everyone defines UTF-8 to be at most 4 bytes long.

And in this case I think the proper behavior would be to emit 5 or 6 bytes.
Think of it: this is what you would do if 5 and 6 byte UTF-8 wasn't ever
defined.


If the text was originally in utf-32 with an invalid high codepoint,
that would result in a single substitution character.  It makes sense
in utf-8 to behave similarly....perhaps.

That said, I don't care for replacement character techniques very much
in general.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to