Re: Revision of UTF-8 history in draft-yergeau-rfc2279bis-05.txt

Markus Kuhn Fri, 13 Jun 2003 07:20:13 -0700

John Cowan wrote on 2003-06-12 21:15 UTC:
> Francois Yergeau scripsit:
> 
> > The other is the modification that, according to Markus' references, Ken
> > Thomson designed on a diner placemat and sent in an email dated Fri Sep  4
> > 03:37:39 EDT 1992.  
> 
> Is this email online anywhere?  Its historical value would be very great.


My little UTF-8 history on

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#history

has a link to

  http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

which contains excerpts of my recent correspondence with Rob Pike on
this issue. It includes the emails and original specs as retrieved a few
days ago from the Plan9 project archives.

If anyone can find a copy of the original FSS/UTF(Miller et al.)
proposal and early X/Open documents on the subject, I would be most
interested in having a look at then, and add them to the above
collection.

Note that the archives contain two snapshots of the original FSS/UTF(Thompson)
specification. The first one encoded 32-bit words, whereas
the later one dropped the use of the bytes 0xfe and 0xff and thereby
encoded only 31-bit words. Thompson had in his original spec even a
note on overlong sequences being illegal!

We can see here the evolution of a very simple and elegant original
design, which was then continuously ground down by the work of committees to
the odd interconnections with UTF-16 that we see today.

[Fortunately though, UTF-16 remains of little bother to anyone in the
Unix/Plan9 world, where UTF-16 and it's 0x10ffff limit are virtually
unheard of, except for the occasional shaking of heads, and very likely
will remain so. The reduction from 0xffffffff to 0x7fffffff was
technically reasonable though, as it permits the use of signed 32-bit
integer types. UTF-16 remains an ugly misscarriage, because by placing
the surrogates not at the end of the 16-bit space but into the middle of
the code range, it leads to an incompatible binary sorting order in
B-trees with UCS-4 and UTF-8 and therefore is useless for database
applications that want to hide the internal encoding from the user of
B-tree iterators.]

It appears that Miller deserves credit for recognizing that UTF-1 was of
no use whatsoever, Thompson and Pike got everything right from the very
beginning, and all the cruelties done to UTF-8 later didn't contribute
the slightest to its ultimate usefulness and beauty.

Now, who invented UTF-16?

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Revision of UTF-8 history in draft-yergeau-rfc2279bis-05.txt

Reply via email to