> From: Kent Karlsson <[EMAIL PROTECTED]> > Now, let's not confuse matters more than necessary. There is no such > thing as "Unicode BMP", Unicode covers planes 0-16 (dec.), nothing less. OK, I adapted the diagram, thank you. I also fixed one of the arrows.
> BMP is a 10646 concept, though that term is also used by Unicode. > UCS-2 is (no longer!) an encoding form of Unicode, though it is > still one of 10646. In time, I hope it will disappear also from > 10646. There are unfortunately two (or three, depending on how you > want to count) varieties of UTF-8. *In practice*, though, there is If you mean the different lenghts (1/2/3/4/6 bytes), I think that's an implementation (or even usage) detail beyond the scope of the diagram. If you mean variations like Java-UTF-8 (with an extra sequence for NUL to avoid the 0 byte), that's probably not worth mentioning, also it's not ASCII-compatible. Or should horrible Oracle-UTF-8 be mentioned anywhere with two 3-byte sequences for UTF-16 surrogate pairs? Better not. > no difference, since 1) all of the private use planes above plane 16 > (dec.) have been removed from 10646, and 2) WG 2 has committed itself > to never standardise any encoding of a character above plane 14 > (decimal; > Plane 0E, in 10646-speak). In time, I hope the formal definitions of > UTF-8 will converge. > > In summary, then, there are three encoding forms (UTF-8, UTF-16, UTF-32) > that (in practice for 10646, also formally for Unicode) cover *exactly > the same* encoding space: 0000-10FFFF (Planes 00-10, hex.). Other > encoding > forms, like SCSU, are separate; i.e. they are not Unicode (or 10646) > encoding forms, even though they may be able to encode any Unicode/10646 > string. And UCS-2 should begin to be regarded as historic by now, though > it still lingers in older systems and 10646... BMP is not that historic if you consider font support of current systems. X-windows non-BMP font display is still experimental and needs sophisticated extra configuration. Truetype also bears some problems as recent mails show, and I don't know how far Type 1 is. In any case, the diagram is meant as an aid for people to understand the relations of the most relevant of these terms in current discussions. > I'm not going to redraw the figure for you, I leave that to you. Yes, and any further comments are welcome. Kind regards, Thomas Wolff > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > Subject: character encoding diagram > > > > Just for fun, I've made the enclosed diagram (using a UML tool) that > > shows the relations of major character sets and encodings. > > A solid line means a compatible extension, a dashed line means an > > implementation (here in the sense that an encoding implements a > > character set).
<<inline: character-encoding.gif>>
