RE: character encoding diagram

Kent Karlsson Thu, 19 Dec 2002 04:26:57 -0800

Now, let's not confuse matters more than necessary.  There is no such
thing as "Unicode BMP", Unicode covers planes 0-16 (dec.), nothing less.
BMP is a 10646 concept, though that term is also used by Unicode.
UCS-2 is (no longer!) an encoding form of Unicode, though it is
still one of 10646.  In time, I hope it will disappear also from
10646.  There are unfortunately two (or three, depending on how you
want to count) varieties of UTF-8.  *In practice*, though, there is
no difference, since 1) all of the private use planes above plane 16
(dec.) have been removed from 10646, and 2) WG 2 has committed itself
to never standardise any encoding of a character above plane 14 (decimal;
Plane 0E, in 10646-speak).  In time, I hope the formal definitions of
UTF-8 will converge.


In summary, then, there are three encoding forms (UTF-8, UTF-16, UTF-32)
that (in practice for 10646, also formally for Unicode) cover *exactly
the same* encoding space: 0000-10FFFF (Planes 00-10, hex.).  Other encoding
forms, like SCSU, are separate; i.e. they are not Unicode (or 10646)
encoding forms, even though they may be able to encode any Unicode/10646
string. And UCS-2 should begin to be regarded as historic by now, though
it still lingers in older systems and 10646...

I'm not going to redraw the figure for you, I leave that to you.

                /Kent Karlsson

        
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of
> [EMAIL PROTECTED]
> Sent: den 18 december 2002 17:52
> To: [EMAIL PROTECTED]
> Subject: character encoding diagram
> 
> 
> Just for fun, I've made the enclosed diagram (using a UML tool) that 
> shows the relations of major character sets and encodings.
> A solid line means a compatible extension, a dashed line means an 
> implementation (here in the sense that an encoding implements a 
> character set).
> Perhaps such a diagram could be useful in a Howto or tutorial.
> 
> Kind regards,
> Thomas Wolff
> 
> 
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: character encoding diagram

Reply via email to