Re: character encoding diagram

Thomas . Wolff Thu, 19 Dec 2002 05:28:17 -0800

> From: Kent Karlsson <[EMAIL PROTECTED]>
> Now, let's not confuse matters more than necessary.  There is no such
> thing as "Unicode BMP", Unicode covers planes 0-16 (dec.), nothing less.
OK, I adapted the diagram, thank you. I also fixed one of the arrows.


> BMP is a 10646 concept, though that term is also used by Unicode.
> UCS-2 is (no longer!) an encoding form of Unicode, though it is
> still one of 10646.  In time, I hope it will disappear also from
> 10646.  There are unfortunately two (or three, depending on how you
> want to count) varieties of UTF-8.  *In practice*, though, there is
If you mean the different lenghts (1/2/3/4/6 bytes), I think that's an 
implementation (or even usage) detail beyond the scope of the diagram. 
If you mean variations like Java-UTF-8 (with an extra sequence for NUL 
to avoid the 0 byte), that's probably not worth mentioning, also it's 
not ASCII-compatible. Or should horrible Oracle-UTF-8 be mentioned 
anywhere with two 3-byte sequences for UTF-16 surrogate pairs? Better not.

> no difference, since 1) all of the private use planes above plane 16
> (dec.) have been removed from 10646, and 2) WG 2 has committed itself
> to never standardise any encoding of a character above plane 14
> (decimal;
> Plane 0E, in 10646-speak).  In time, I hope the formal definitions of
> UTF-8 will converge.
> 
> In summary, then, there are three encoding forms (UTF-8, UTF-16, UTF-32)
> that (in practice for 10646, also formally for Unicode) cover *exactly
> the same* encoding space: 0000-10FFFF (Planes 00-10, hex.).  Other
> encoding
> forms, like SCSU, are separate; i.e. they are not Unicode (or 10646)
> encoding forms, even though they may be able to encode any Unicode/10646
> string. And UCS-2 should begin to be regarded as historic by now, though
> it still lingers in older systems and 10646...
BMP is not that historic if you consider font support of current systems.
X-windows non-BMP font display is still experimental and needs 
sophisticated extra configuration. Truetype also bears some problems 
as recent mails show, and I don't know how far Type 1 is.
In any case, the diagram is meant as an aid for people to understand 
the relations of the most relevant of these terms in current discussions.

> I'm not going to redraw the figure for you, I leave that to you.
Yes, and any further comments are welcome.

Kind regards,
Thomas Wolff


> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > Subject: character encoding diagram
> > 
> > Just for fun, I've made the enclosed diagram (using a UML tool) that 
> > shows the relations of major character sets and encodings.
> > A solid line means a compatible extension, a dashed line means an 
> > implementation (here in the sense that an encoding implements a 
> > character set).

<<inline: character-encoding.gif>>

Re: character encoding diagram

Reply via email to