On Fri, May 25, 2007 at 06:12:13PM +0200, Thomas Wolff wrote:

> I have not heard anything like this before (about changing behaviour 
> of emitted replacement characters)

So far there lived two concurrent definitions of UTF-8, one defined it to be
at most 4 bytes long, while the other one defined 6 bytes.

If I correctly understand the thread that's just been discussed on this
list, starting at:
http://mail.nl.linux.org/linux-utf8/2007-04/msg00050.html
then from now on everyone defines UTF-8 to be at most 4 bytes long.

And in this case I think the proper behavior would be to emit 5 or 6 bytes.
Think of it: this is what you would do if 5 and 6 byte UTF-8 wasn't ever
defined.

> Why cannot a long UTF-8 sequence that happens to map to a code point which is 
> not Unicode just be displayed with one replacement character?

I'd perfectly agree with you, I also dislike this 4-byte limitation and
preferred the 6-byte version. But apparently this is not what Unicode-gurus
have decided to have. That's why I'm asking gurus here (especially Markus)
what to do know.



-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to