Followup to: <[EMAIL PROTECTED]>
By author: Roozbeh Pournader <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
>
> On 6 Jun 2001, H. Peter Anvin wrote:
>
> > There is only one UTF-8.
>
> But which is that? The one described in RFC 2279, the one in ISO
> 10646-1:2000, or the one in Unicode 3.1? These are different.
>
The only difference is how permissive the standard is with respect to
the handling of irregular sequences. No standard has ever required
interpretation of irregular sequences (except perhaps as a
specification bug), and the only safe answer has always been to reject
them.
The recent "we have this UTF-16 thing and let's break everything else
to cater to it" trend really is irrelevant for that purpose; it
doesn't affect UCS-4 <-> UTF-8 conversion at all, since there is no
aliasing, and it's pretty much up to the application if it wants to
treat a UCS-4 value of 0x60000000 [for example] as an error sentinel
or as an undefined character -- it has the same effect either way. No
aliasing of invalid characters to valid characters.
The one thing that should *NOT* be done, and which the standards will
not tell you because of politics (read: software vendors would have to
upgrade their buggy software), is the UTF-16-in-UTF-8 conversions --
such code points are bogus. Period. The assertion that "it's not in
the BMP, it's not important" is broken -- a lot of issues occur
because of aliasing in general, not because of any particular
sequencing.
Doing UTF-8 correctly has never been ambiguous.
-hpa
--
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/