Hi, Egmont wrote: > Now that recently every standard seemed to agree that UTF-8 uses at most 4 > (and not 6) bytes and the highest valid Unicode value is U+1FFFFF, I wonder U+10FFFF, actually.
> whether the stress test should be updated, too. As far as I understand, the > preferred new behavior for a former 5 or 6 byte long UTF-8 sequence is to > emit 5 or 6 replacement character, since the first byte is invalid, and > subsequent bytes are unexpected continuation bytes. I have not heard anything like this before (about changing behaviour of emitted replacement characters) and it would be really confusing to introduce it. UTF-8 is a simple and straight-forward encoding scheme which happens to cover full historic 31-bit ISO 10646. That many of those code points are now invalid does not necessarily mean that the interpretation of UTF-8 would have to be changed. I don't think it's worth introducing this additional headache, especially as it would introduce new inconsistencies between older and newer versions of terminals, which we already have plenty of. Why cannot a long UTF-8 sequence that happens to map to a code point which is not Unicode just be displayed with one replacement character? There is no good reason for this, please don't push it forward. Kind regards, Thomas -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
