> -----Original Message-----
> From: Markus Kuhn [mailto:[EMAIL PROTECTED]]
...
> > > > 2. Allowing 5 and 6-byte UTF-8, which Unicode 3.1 forbids.
> > >
> > > http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
> > >
> > > Neither of these is a deviation of ISO 10646, which has a somewhat
> > > broader scope than Unicode and is (at least in the context of
> > > communication with ISO 6429 terminals) the preferred reference.
I would not subscribe to the characterisation that 10646 has any broader
scope than Unicode at all.
> > Are you sure this isn't a deviation of ISO 10646? I thought they
> > removed the 5 and 6-byte UTF-8 sequences in the latest stuff.
>
> Not in ISO/IEC 10646-1:2000.
>
> The rumours about UTF-8 being restricted to 4 bytes are just a Fear,
> Uncertainty and Doubt strategy by the dark lords of the UTF-16 cult and
> their 16-bit Win32 religion.
That is just bullshit, Markus! The reason Unicode has a COMMON limit
to 10FFFF for all three of their encoding forms is for INTEROPERABILITY
reasons. And those are upheld by everyone in the UTC, AFAIK.
> The private use groups at the far end of the 31-bit UCS are perfectly
> good and useful in potential future schemes to guarantee say roundtrip
> compatibility to various encodings with up to 2^29 code positions (full
> ISO 2022/ISO IR, keysyms, etc.). There is not the slightest reason for
> POSIX implementors to not support the full 6-byte version of UTF-8 as
> defined in ISO 10646-1:2000.
There is not YET a formal change of UTF-8 in 10646. The long term goal
is still to both a) synchronise the Unicode and 10646 definitions of the
encoding forms and b) make 10646 also respect interoperability concerns.
As a step in that direction, Amendment 1 to 10646:2000 will a) remove all
the private use planes above 10FFFF and b) contain a strong indication
(as a note) that no characters will be allocated above 10FFFF. I.e. WG2
does NOT envision that those private use planes are of any use at all,
in particular not the ones you indicate (which would be quite destructive).
Kind regards
/kent k
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/