Hi Markus,
I have a paper copy of SC22/WG20 N193 from May
1993. It was submitted to WG20 for consideration by Gary Miller (IBM), who
was a member of WG20 at that time.
X/Open Preliminary Specification
File System Safe UCS Transformation
Format (FSS-UTF)
ISBN: 1-872630-96-0
X/Open Document Number:
P316
with a copyright notice from X/Open, May 1993.
There are 22
(partly empty) pages in this document, and it is a possibly a copy of a copy,
but it is perfectly readable.
A short excerpt:
2.2
Specification
The FSS-UTF encodes UCS values in the range
[0,0x7FFFFFFF] using multi-byte characters of lengths 1, 2, 3, 4, 5 and 6
bytes.
For all encodings ....
I could scan the document and make a PDF
file out of it (still rather large, I guess), or what would you prefer ?
Please advise
Arnold
-----Original Message-----
From: Markus Kuhn [mailto:[EMAIL PROTECTED]]
Sent:
Friday, June 13, 2003 10:19 AM
To: [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: Revision of UTF-8 history
in
draft-yergeau-rfc2279bis-05.txt
John Cowan wrote on 2003-06-12
21:15 UTC:
> Francois Yergeau scripsit:
>
> > The other is
the modification that, according to Markus' references, Ken
> > Thomson
designed on a diner placemat and sent in an email dated Fri Sep 4
>
> 03:37:39 EDT 1992.
>
> Is this email online
anywhere? Its historical value would be very great.
My little UTF-8
history on
http://www.cl.cam.ac.uk/~mgk25/unicode.html#history
has
a link to
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
which
contains excerpts of my recent correspondence with Rob Pike on
this issue. It
includes the emails and original specs as retrieved a few
days ago from the
Plan9 project archives.
If anyone can find a copy of the original
FSS/UTF(Miller et al.)
proposal and early X/Open documents on the subject, I
would be most
interested in having a look at then, and add them to the
above
collection.
Note that the archives contain two snapshots of the
original FSS/UTF(Thompson)
specification. The first one encoded 32-bit words,
whereas
the later one dropped the use of the bytes 0xfe and 0xff and
thereby
encoded only 31-bit words. Thompson had in his original spec even
a
note on overlong sequences being illegal!
We can see here the
evolution of a very simple and elegant original
design, which was then
continuously ground down by the work of committees to
the odd
interconnections with UTF-16 that we see today.
[Fortunately though,
UTF-16 remains of little bother to anyone in the
Unix/Plan9 world, where
UTF-16 and it's 0x10ffff limit are virtually
unheard of, except for the
occasional shaking of heads, and very likely
will remain so. The reduction
from 0xffffffff to 0x7fffffff was
technically reasonable though, as it
permits the use of signed 32-bit
integer types. UTF-16 remains an ugly
misscarriage, because by placing
the surrogates not at the end of the 16-bit
space but into the middle of
the code range, it leads to an incompatible
binary sorting order in
B-trees with UCS-4 and UTF-8 and therefore is useless
for database
applications that want to hide the internal encoding from the
user of
B-tree iterators.]
It appears that Miller deserves credit for
recognizing that UTF-1 was of
no use whatsoever, Thompson and Pike got
everything right from the very
beginning, and all the cruelties done to UTF-8
later didn't contribute
the slightest to its ultimate usefulness and
beauty.
Now, who invented UTF-16?
Markus
--
Markus Kuhn,
Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ |
__oo_O..O_oo__
