Re: UTF-8 as the single common encoding everywhere

Edmund GRIMLEY EVANS Thu, 07 Jun 2001 01:24:12 -0700

H. Peter Anvin <[EMAIL PROTECTED]>:

> > But which is that? The one described in RFC 2279, the one in ISO
> > 10646-1:2000, or the one in Unicode 3.1? These are different.
> > 
> 
> The only difference is how permissive the standard is with respect to
> the handling of irregular sequences.  No standard has ever required
> interpretation of irregular sequences (except perhaps as a
> specification bug), and the only safe answer has always been to reject
> them.

But sometimes it is not possible to reject sequences; you have to do
something with the data, even if that means replacing it by '?'s. So
in some circumstances it might be better to accept and generate UTF-8
sequences corresponding to all of the integers from 0 to 2^31-1. That
is, after all, the simpest and most logical behaviour, and it would be
the standard behaviour if there were no endian and UTF-16 problems.

It sort of irritates me that in a UCS-4/UTF-8 world we are expected to
treat U+D800..U+DFFF and U+FFFE and U+FFFF as illegal.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 as the single common encoding everywhere

Reply via email to