Re: UTF-8 as the single common encoding everywhere

H. Peter Anvin Wed, 06 Jun 2001 16:42:46 -0700
Followup to:  <[EMAIL PROTECTED]>
By author:    Roozbeh Pournader <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
>
> On 6 Jun 2001, H. Peter Anvin wrote:
> 
> > There is only one UTF-8.
> 
> But which is that? The one described in RFC 2279, the one in ISO
> 10646-1:2000, or the one in Unicode 3.1? These are different.
> 

The only difference is how permissive the standard is with respect to
the handling of irregular sequences.  No standard has ever required
interpretation of irregular sequences (except perhaps as a
specification bug), and the only safe answer has always been to reject
them.

The recent "we have this UTF-16 thing and let's break everything else
to cater to it" trend really is irrelevant for that purpose; it
doesn't affect UCS-4 <-> UTF-8 conversion at all, since there is no
aliasing, and it's pretty much up to the application if it wants to
treat a UCS-4 value of 0x60000000 [for example] as an error sentinel
or as an undefined character -- it has the same effect either way.  No
aliasing of invalid characters to valid characters.

The one thing that should *NOT* be done, and which the standards will
not tell you because of politics (read: software vendors would have to
upgrade their buggy software), is the UTF-16-in-UTF-8 conversions --
such code points are bogus.  Period.  The assertion that "it's not in
the BMP, it's not important" is broken -- a lot of issues occur
because of aliasing in general, not because of any particular
sequencing.

Doing UTF-8 correctly has never been ambiguous.

        -hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 as the single common encoding everywhere

Reply via email to