H. Peter Anvin <[EMAIL PROTECTED]>:
> > But which is that? The one described in RFC 2279, the one in ISO
> > 10646-1:2000, or the one in Unicode 3.1? These are different.
> >
>
> The only difference is how permissive the standard is with respect to
> the handling of irregular sequences. No standard has ever required
> interpretation of irregular sequences (except perhaps as a
> specification bug), and the only safe answer has always been to reject
> them.
But sometimes it is not possible to reject sequences; you have to do
something with the data, even if that means replacing it by '?'s. So
in some circumstances it might be better to accept and generate UTF-8
sequences corresponding to all of the integers from 0 to 2^31-1. That
is, after all, the simpest and most logical behaviour, and it would be
the standard behaviour if there were no endian and UTF-16 problems.
It sort of irritates me that in a UCS-4/UTF-8 world we are expected to
treat U+D800..U+DFFF and U+FFFE and U+FFFF as illegal.
Edmund
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/