Hi,

Egmont wrote:
> Now that recently every standard seemed to agree that UTF-8 uses at most 4
> (and not 6) bytes and the highest valid Unicode value is U+1FFFFF, I wonder
U+10FFFF, actually.

> whether the stress test should be updated, too. As far as I understand, the
> preferred new behavior for a former 5 or 6 byte long UTF-8 sequence is to
> emit 5 or 6 replacement character, since the first byte is invalid, and
> subsequent bytes are unexpected continuation bytes.
I have not heard anything like this before (about changing behaviour 
of emitted replacement characters) and it would be really confusing to 
introduce it. UTF-8 is a simple and straight-forward encoding scheme which 
happens to cover full historic 31-bit ISO 10646. That many of those 
code points are now invalid does not necessarily mean that the interpretation 
of UTF-8 would have to be changed. I don't think it's worth introducing 
this additional headache, especially as it would introduce new inconsistencies 
between older and newer versions of terminals, which we already have plenty of.
Why cannot a long UTF-8 sequence that happens to map to a code point which is 
not Unicode just be displayed with one replacement character? There is no 
good reason for this, please don't push it forward.

Kind regards,
Thomas

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to