RE: Revision of UTF-8 history in draft-yergeau-rfc2279bis-05.txt

Winkler, Arnold F Fri, 13 Jun 2003 09:40:38 -0700

Hi Markus,

I have a paper copy of SC22/WG20 N193 from May 1993. It was submitted to WG20 for consideration by Gary Miller (IBM), who was a member of WG20 at that time.

X/Open Preliminary Specification
File System Safe UCS Transformation Format (FSS-UTF)

ISBN: 1-872630-96-0
X/Open Document Number: P316

with a copyright notice from X/Open, May 1993.

There are 22 (partly empty) pages in this document, and it is a possibly a copy of a copy, but it is perfectly readable.

A short excerpt:
2.2 Specification
The FSS-UTF encodes UCS values in the range [0,0x7FFFFFFF] using multi-byte characters of lengths 1, 2, 3, 4, 5 and 6 bytes.
For all encodings ....

I could scan the document and make a PDF file out of it (still rather large, I guess), or what would you prefer ?

Please advise

Arnold

[EMAIL PROTECTED]

-----Original Message-----
From: Markus Kuhn [mailto:[EMAIL PROTECTED]]
Sent: Friday, June 13, 2003 10:19 AM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: Revision of UTF-8 history in
draft-yergeau-rfc2279bis-05.txt

John Cowan wrote on 2003-06-12 21:15 UTC:
> Francois Yergeau scripsit:
>
> > The other is the modification that, according to Markus' references, Ken
> > Thomson designed on a diner placemat and sent in an email dated Fri Sep 4
> > 03:37:39 EDT 1992.
>
> Is this email online anywhere? Its historical value would be very great.

My little UTF-8 history on

http://www.cl.cam.ac.uk/~mgk25/unicode.html#history

has a link to

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

which contains excerpts of my recent correspondence with Rob Pike on
this issue. It includes the emails and original specs as retrieved a few
days ago from the Plan9 project archives.

If anyone can find a copy of the original FSS/UTF(Miller et al.)
proposal and early X/Open documents on the subject, I would be most
interested in having a look at then, and add them to the above
collection.

Note that the archives contain two snapshots of the original FSS/UTF(Thompson)
specification. The first one encoded 32-bit words, whereas
the later one dropped the use of the bytes 0xfe and 0xff and thereby
encoded only 31-bit words. Thompson had in his original spec even a
note on overlong sequences being illegal!

We can see here the evolution of a very simple and elegant original
design, which was then continuously ground down by the work of committees to
the odd interconnections with UTF-16 that we see today.

[Fortunately though, UTF-16 remains of little bother to anyone in the
Unix/Plan9 world, where UTF-16 and it's 0x10ffff limit are virtually
unheard of, except for the occasional shaking of heads, and very likely
will remain so. The reduction from 0xffffffff to 0x7fffffff was
technically reasonable though, as it permits the use of signed 32-bit
integer types. UTF-16 remains an ugly misscarriage, because by placing
the surrogates not at the end of the 16-bit space but into the middle of
the code range, it leads to an incompatible binary sorting order in
B-trees with UCS-4 and UTF-8 and therefore is useless for database
applications that want to hide the internal encoding from the user of
B-tree iterators.]

It appears that Miller deserves credit for recognizing that UTF-1 was of
no use whatsoever, Thompson and Pike got everything right from the very
beginning, and all the cruelties done to UTF-8 later didn't contribute
the slightest to its ultimate usefulness and beauty.

Now, who invented UTF-16?

Markus

--
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__

RE: Revision of UTF-8 history in draft-yergeau-rfc2279bis-05.txt

Reply via email to