John Cowan wrote on 2003-06-12 21:15 UTC: > Francois Yergeau scripsit: > > > The other is the modification that, according to Markus' references, Ken > > Thomson designed on a diner placemat and sent in an email dated Fri Sep 4 > > 03:37:39 EDT 1992. > > Is this email online anywhere? Its historical value would be very great.
My little UTF-8 history on http://www.cl.cam.ac.uk/~mgk25/unicode.html#history has a link to http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt which contains excerpts of my recent correspondence with Rob Pike on this issue. It includes the emails and original specs as retrieved a few days ago from the Plan9 project archives. If anyone can find a copy of the original FSS/UTF(Miller et al.) proposal and early X/Open documents on the subject, I would be most interested in having a look at then, and add them to the above collection. Note that the archives contain two snapshots of the original FSS/UTF(Thompson) specification. The first one encoded 32-bit words, whereas the later one dropped the use of the bytes 0xfe and 0xff and thereby encoded only 31-bit words. Thompson had in his original spec even a note on overlong sequences being illegal! We can see here the evolution of a very simple and elegant original design, which was then continuously ground down by the work of committees to the odd interconnections with UTF-16 that we see today. [Fortunately though, UTF-16 remains of little bother to anyone in the Unix/Plan9 world, where UTF-16 and it's 0x10ffff limit are virtually unheard of, except for the occasional shaking of heads, and very likely will remain so. The reduction from 0xffffffff to 0x7fffffff was technically reasonable though, as it permits the use of signed 32-bit integer types. UTF-16 remains an ugly misscarriage, because by placing the surrogates not at the end of the 16-bit space but into the middle of the code range, it leads to an incompatible binary sorting order in B-trees with UCS-4 and UTF-8 and therefore is useless for database applications that want to hide the internal encoding from the user of B-tree iterators.] It appears that Miller deserves credit for recognizing that UTF-1 was of no use whatsoever, Thompson and Pike got everything right from the very beginning, and all the cruelties done to UTF-8 later didn't contribute the slightest to its ultimate usefulness and beauty. Now, who invented UTF-16? Markus -- Markus Kuhn, Computer Lab, Univ of Cambridge, GB http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
