Now, let's not confuse matters more than necessary. There is no such
thing as "Unicode BMP", Unicode covers planes 0-16 (dec.), nothing less.
BMP is a 10646 concept, though that term is also used by Unicode.
UCS-2 is (no longer!) an encoding form of Unicode, though it is
still one of 10646. In time, I hope it will disappear also from
10646. There are unfortunately two (or three, depending on how you
want to count) varieties of UTF-8. *In practice*, though, there is
no difference, since 1) all of the private use planes above plane 16
(dec.) have been removed from 10646, and 2) WG 2 has committed itself
to never standardise any encoding of a character above plane 14 (decimal;
Plane 0E, in 10646-speak). In time, I hope the formal definitions of
UTF-8 will converge.
In summary, then, there are three encoding forms (UTF-8, UTF-16, UTF-32)
that (in practice for 10646, also formally for Unicode) cover *exactly
the same* encoding space: 0000-10FFFF (Planes 00-10, hex.). Other encoding
forms, like SCSU, are separate; i.e. they are not Unicode (or 10646)
encoding forms, even though they may be able to encode any Unicode/10646
string. And UCS-2 should begin to be regarded as historic by now, though
it still lingers in older systems and 10646...
I'm not going to redraw the figure for you, I leave that to you.
/Kent Karlsson
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of
> [EMAIL PROTECTED]
> Sent: den 18 december 2002 17:52
> To: [EMAIL PROTECTED]
> Subject: character encoding diagram
>
>
> Just for fun, I've made the enclosed diagram (using a UML tool) that
> shows the relations of major character sets and encodings.
> A solid line means a compatible extension, a dashed line means an
> implementation (here in the sense that an encoding implements a
> character set).
> Perhaps such a diagram could be useful in a Howto or tutorial.
>
> Kind regards,
> Thomas Wolff
>
>
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/