Re: Revision of UTF-8 history in draft-yergeau-rfc2279bis-05.txt

Doug Ewell Fri, 13 Jun 2003 09:06:26 -0700

Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk> wrote:

> UTF-16 remains an ugly misscarriage, because by placing
> the surrogates not at the end of the 16-bit space but into the middle
> of the code range, it leads to an incompatible binary sorting order in
> B-trees with UCS-4 and UTF-8 and therefore is useless for database
> applications that want to hide the internal encoding from the user of
> B-tree iterators.]


I wasn't there, but I'm sure the creators of UTF-16 would have loved to
put the surrogates at the end of the 16-bit code space, if only
characters hadn't already been assigned there that were probably in much
greater use from the outset than the Korean syllables (which were
subsequently moved, resulting in a lot of criticism of Unicode for its
"instability").

Of course, putting the surrogates at the end of the code space would
have meant the logic surrounding U+FEFF (BOM) vs. U+FFFE (unassignable
code point, used for endian checking) would have had to be re-thought
somewhat.

If UTF-16 had been designed into the architecture at the beginning, many
of these historical decisions could have been made differently.  But the
original vision for Unicode, at least for some, was to encode only the
most commonly used characters (not every Han character ever listed in a
dictionary, not all 11,172 modern Hangul syllables, not hundreds of
Arabic contextual forms) and leave lesser-used characters to the Private
Use Area.  Reducing the scope in this way would have made the original
vision of fitting everything into 16 bits much more realistic.

> It appears that Miller deserves credit for recognizing that UTF-1 was
> of no use whatsoever,

That seems an overstatement.  It's true that UTF-1 didn't protect the
ASCII slash and other similarly important characters from appearing in
multi-byte representations, which rendered it useless for file names.
It was also slower in implementation than UTF-8, because it used integer
division instead of bit shifting (no word on whether this ever made a
practical difference, though).

But as an encoding to be used *within* files (not in file names), UTF-1
had some advantages over UTF-8 that we used to hear quite a bit on this
list, oh, maybe five years ago: Latin-1 legibility and non-use of C1
control characters.

Latin-1 characters were encoded in UTF-1 by prepending 0xA0 (NO-BREAK
SPACE), which made them fairly readable when rendered by a
Unicode-ignorant display engine.  UTF-8 does something similar
(prepending Â) for Latin-1 symbols below 0xC0, but the Latin letters
starting at 0xC0 are not readable at all.  This may not seem to be a big
deal now, but back in the mid-'90s it was a HUGE problem for some
people.

Also, as we know, UTF-8 uses bytes in the C1 control range (0x80 to
0x9F) as continuation bytes in multi-byte sequences.  People uses to
complain mightily about how this broke terminal programs that
interpreted these bytes before the UTF-8 decoder had a chance to see
them, and performed control functions that might switch character sets
or even hang the terminal.  Again, much software is now built to
understand UTF-8, but that wasn't the case just a few years ago.  UTF-1
protected C1 bytes, but of course used printable ASCII instead (which
led to different problems).

I wouldn't recommend that we all drop everything and switch to UTF-1,
but it was not 100 percent evil.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Revision of UTF-8 history in draft-yergeau-rfc2279bis-05.txt

Reply via email to