RE: character encoding diagram

Kent Karlsson Thu, 19 Dec 2002 06:15:13 -0800

...
> > 10646.  There are unfortunately two (or three, depending on how you
> > want to count) varieties of UTF-8.  *In practice*, though, there is


> If you mean the different lenghts (1/2/3/4/6 bytes),

I mean the Unicode UTF-8, the 10464 UTF-8 and the IETF UTF-8.

> If you mean variations like Java-UTF-8 (with an extra 
> sequence for NUL 
> to avoid the 0 byte), 

That is NOT UTF-8, despite the name.

> that's probably not worth mentioning, also it's 
> not ASCII-compatible. Or should horrible Oracle-UTF-8 be mentioned 

No. (Then we're up to four... But theirs is not called UTF-8.)



> BMP is not that historic if you consider font support of 
> current systems.

BMP is not historic.  But UCS-2 is, both as seen from Unicode and 10646
(almost), as well as Windows XP (almost), MacOS X, and Oracle (almost).
But yes, font support beyond BMP is not good (but that is to be expected,
*most*, not all, of the characters there are of interest only to very few).

> X-windows non-BMP font display is still experimental and needs 
> sophisticated extra configuration.

Not very sophisticated... But it is unfortunate that it is not default.

> Truetype also bears some problems as recent mails show, 

Not that I know of.  Not AAT and OpenType anyway (which are both
successors of TrueType).

> and I don't know how far Type 1 is.

> In any case, the diagram is meant as an aid for people to understand 
> the relations of the most relevant of these terms in current 
> discussions.

Then have one encoding space for ASCII, one for Latin-1, and *ONLY ONE*
encoding space for Unicode/10646, since that is what is of practical
interest.  Skip UCS-2, and skip the variations of the (true) UTF-8's (the
latter you have almost done).  By all means, mention the BMP.  BMP still
has, and will likely retain, a priority position for both Unicode and 10646.
But not UCS-2, or mention it very parenthetically.  It is already dropped,
since long, from Unicode, and is just a lingering wart on 10646 (and some
systems).

(When I read your first e-mail, before opening the image, I expected
at least a couple of dozen encodings, to be covered, ...   Several of them
are extentions on extentsions; e.g. GB 18030 is an extension of GB 2312,
which is an extension of ASCII.)

                        /Kent K
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: character encoding diagram

Reply via email to