Egmont wrote:

> On Fri, May 25, 2007 at 06:12:13PM +0200, Thomas Wolff wrote:

> > I have not heard anything like this before (about changing behaviour 
> > of emitted replacement characters)

> So far there lived two concurrent definitions of UTF-8, one defined it to be
> at most 4 bytes long, while the other one defined 6 bytes.

> If I correctly understand the thread that's just been discussed on this
> list, starting at:
> http://mail.nl.linux.org/linux-utf8/2007-04/msg00050.html
> then from now on everyone defines UTF-8 to be at most 4 bytes long.
The paper mentioned there is only a discussion paper - some personal opinion.
It starts:
"Doc Type       Working Group Document
 Title  Synchronization Issues for UTF-8
 Source Ken Whistler
 Status Individual Contribution"

And I don't think there can be "concurrent definitions" of UTF-8.

UTF-8 is clearly defined by RFC 2279 which maintains the clear 
1-to-6-bytes encoding scheme of RFC 2044 with no confusion - and will 
hopefully remain so.
This definition is not affected by the question of whether any of 
the encoded code points is "invalid" or anything else.


> And in this case I think the proper behavior would be to emit 5 or 6 bytes.
Don't think so but that doesn't matter as it isn't a case anyway :)
> Think of it: this is what you would do if 5 and 6 byte UTF-8 wasn't ever
> defined.
It was and it is defined, so no confusion needed.

> > Why cannot a long UTF-8 sequence that happens to map to a code point which 
> > is 
> > not Unicode just be displayed with one replacement character?

> I'd perfectly agree with you, I also dislike this 4-byte limitation and
> preferred the 6-byte version. But apparently this is not what Unicode-gurus
> have decided to have. That's why I'm asking gurus here (especially Markus)
> what to do know.

Thanks for agreeing with me, so let's stick to the formal definition which 
does not call for any change.
I hope Ken Whistler is reading this. Does he have a mail address?

Thomas

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to