Karlsson Kent - keka wrote on 2001-05-12 07:32 UTC:
> > The private use groups at the far end of the 31-bit UCS are perfectly
> > good and useful in potential future schemes to guarantee say roundtrip
> > compatibility to various encodings with up to 2^29 code positions (full
> > ISO 2022/ISO IR, keysyms, etc.). There is not the slightest reason for
> > POSIX implementors to not support the full 6-byte version of UTF-8 as
> > defined in ISO 10646-1:2000.
>
> There is not YET a formal change of UTF-8 in 10646. The long term goal
> is still to both a) synchronise the Unicode and 10646 definitions of the
> encoding forms and b) make 10646 also respect interoperability concerns.
>
> As a step in that direction, Amendment 1 to 10646:2000 will a) remove all
> the private use planes above 10FFFF and b) contain a strong indication
> (as a note) that no characters will be allocated above 10FFFF. I.e. WG2
> does NOT envision that those private use planes are of any use at all,
> in particular not the ones you indicate (which would be quite destructive).
I am very happy with the added note that promises that no standard
characters will be added beyond U-10FFFF. However, I'd prefer if as a
consequence the entire code space starting at U-110000 upwards were
declared to be reserved for private use outside the scope of UCS.
We have now in wchar_t a nice infrastructure for handling 31-bit
characters, and I do urge all implementors of UTF-8 encoders and
decoders to keep them fully 31-bit transparent. UTF-16 is pretty
irrelevant to the GNU/POSIX platform. The wc API was not designed to
handle double-double-byte characters such as surrogate pairs. Why should
Linux programmers destroy the potentially useful full 31-bit space, just
because of silly interoperability concerns by the UTF-16 crowd? They are
just applying flawed logic IMHO: Private use characters are per
definition non-interoperable anyway, independent whether they can be
represented in Word doc files or not. They are still very useful for
numerous special purpose applications and there are many good reasons
why UCS was designed to be a 31-bit and not a 21- or 24-bit codespace.
Keep those UTF-8 engines 31-bit transparent! Don't waste precious bits!
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/