From: [EMAIL PROTECTED] on behalf of Joe
Sent: Thu 6/12/2003 1:16 PM
To: 'Markus Kuhn'; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: Revision of UTF-8 history in draft-yergeau-rfc2279bis-05.txt
> It replaced an earlier attempt to design a FSS/UTF (file
system safe UCS transformation format) that was circulated in an X/Open working
document in August 1992 by Gary Miller (IBM), Greger Leijonhufvud and John
Entenmann (SMI) as a replacement for the division-heavy UTF-1 encoding from the
first edition of ISO 10646-1. ... FSS/UTF was briefly also referred to as UTF-2
and later renamed into UTF-8
This corresponds to what I know of the
history, I believe Gary Miller was a major initiator and proponent of the first
FSS-UTF design.
Despite Ed's qualifications, I believe it is a fair
shorthand to say that Ken Thompson designed what is now known as UTF-8, and Plan
9 was its initial
venue.
Joe
---------------------------
From:
[EMAIL PROTECTED]
Date: Tue, 8 Sep 92 03:22:07 EDT
To:
[EMAIL PROTECTED]
Subject: (XoJIG 620) <Subject missing>
Here is
our modified FSS-UTF proposal. The words are the same as on the previous
proposal. My apologies to the author. The code has been tested to
some degree and should be pretty good shape. We have converted Plan 9 to
use this encoding and are about to issue a distribution to an initial set of
university users.
File System Safe Universal Character Set Transformation
Format
(FSS-UTF)
--------------------------------------------------------------------------
With
the approval of ISO/IEC 10646 (Unicode) as an international standard and the
anticipated wide spread use of this universal coded character set (UCS), it is
necessary for historically ASCII based operating systems to devise ways to cope
with representation and handling of the large number of characters that are
possible to be encoded by this new standard.
There are several challenges
presented by UCS which must be dealt with by historical operating systems and
the C-language programming environment. The most significant of these
challenges is the encoding scheme used by UCS. More precisely, the challenge is
the marrying of the UCS standard with existing programming languages and
existing operating systems and utilities.
The challenges of the
programming languages and the UCS standard are being dealt with by other
activities in the industry. However, we are still faced with the handling
of UCS by historical operating systems and utilities. Prominent among the
operating system UCS handling concerns is the representation of the data within
the file system. An underlying assumption is that there is an absolute
requirement to maintain the existing operating system software investment while
at the same time taking advantage of the use the large number of characters
provided by the UCS.
UCS provides the capability to encode multi-lingual
text within a single coded character set. However, UCS and its UTF variant
do not protect null bytes and/or the ASCII slash ("/") making these character
encodings incompatible with existing Unix implementations. The following
proposal provides a Unix compatible transformation format of UCS such that Unix
systems can support multi-lingual text in a single encoding. This
transformation format encoding is intended to be used as a file code. This
transformation format encoding of UCS is intended as an intermediate step
towards full UCS support. However, since nearly all Unix implementations
face the same obstacles in supporting UCS, this proposal is intended to provide
a common and compatible encoding during this transition
stage.
Goal/Objective
--------------
With the assumption
that most, if not all, of the issues surrounding the handling and storing of UCS
in historical operating system file systems are understood, the objective is to
define a UCS transformation format which also meets the requirement of being
usable on a historical operating system file system in a non-disruptive
manner. The intent is that UCS will be the process code for the
transformation format, which is usable as a file code.
Criteria for the
Transformation Format
--------------------------------------
Below are
the guidelines that were used in defining the UCS transformation
format:
1) Compatibility with
historical file systems:
Historical file systems disallow the null byte and the ASCII slash character as
a part of the file name.
2)
Compatibility with existing
programs:
The existing model
for multibyte processing is that ASCII does not occur anywhere in a multibyte
encoding. There should be no ASCII code values for any part of a
transformation format representation of a character that was not in the ASCII
character set in the UCS representation of the
character.
3) Ease of
conversion from/to UCS.
4) The
first byte should indicate the number of bytes to follow in a multibyte
sequence.
5) The
transformation format should not be extravagant in terms of number of bytes used
for encoding.
6) It should be
possible to find the start of a character efficiently starting from an arbitrary
location in a byte stream.
Proposed
FSS-UTF
----------------
The proposed UCS transformation format
encodes UCS values in the range [0,0x7fffffff] using multibyte characters of
lengths 1, 2, 3, 4, 5, and 6 bytes. For all encodings of more than one
byte, the initial byte determines the number of bytes used and the high-order
bit in each byte is set. Every byte that does not start 10xxxxxx is the
start of a UCS character sequence.
An easy way to remember this
transformation format is to note that the number of high-order 1's in the first
byte signifies the number of bytes in the multibyte
character:
Bits Hex Min Hex Max Byte
Sequence in Binary
1 7 00000000 0000007f
0vvvvvvv
2 11 00000080 000007FF 110vvvvv
10vvvvvv
3 16 00000800 0000FFFF 1110vvvv 10vvvvvv
10vvvvvv
4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv
10vvvvvv
5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv
10vvvvvv 10vvvvvv
6 31 04000000 7FFFFFFF 1111110v 10vvvvvv
10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
The UCS value is just the
concatenation of the v bits in the multibyte encoding. When there are
multiple ways to encode a value, for example UCS 0, only the shortest encoding
is legal.
Below are sample implementations of the C standard wctomb() and
mbtowc() functions which demonstrate the algorithms for converting from UCS to
the transformation format and converting from the transformation format to UCS.
The sample implementations include error checks, some of which may not be
necessary for
conformance:
typedef
struct
{
int cmask;
int cval;
int shift;
long lmask;
long lval;
}
Tab;
static
Tab tab[]
=
{
0x80,
0x00, 0*6, 0x7F,
0,
/* 1 byte sequence
*/
0xE0,
0xC0, 1*6, 0x7FF,
0x80,
/* 2 byte sequence
*/
0xF0,
0xE0, 2*6, 0xFFFF,
0x800,
/* 3 byte sequence
*/
0xF8,
0xF0, 3*6,
0x1FFFFF,
0x10000, /* 4 byte sequence
*/
0xFC,
0xF8, 4*6,
0x3FFFFFF,
0x200000, /* 5 byte sequence
*/
0xFE,
0xFC, 5*6, 0x7FFFFFFF,
0x4000000, /* 6 byte sequence
*/
0,
/* end of table
*/
};
int
mbtowc(wchar_t *p, char *s, size_t
n)
{
long
l;
int c0, c,
nc;
Tab
*t;
if(s ==
0)
return
0;
nc =
0;
if(n <=
nc)
return
-1;
c0 = *s &
0xff;
l =
c0;
for(t=tab; t->cmask; t++)
{
nc++;
if((c0 & t->cmask) ==
t->cval) {
l &=
t->lmask;
if(l <
t->lval)
return
-1;
*p =
l;
return
nc;
}
if(n <=
nc)
return
-1;
s++;
c = (*s ^ 0x80) &
0xFF;
if(c &
0xC0)
return
-1;
l = (l<<6) |
c;
}
return
-1;
}
int
wctomb(char *s, wchar_t
wc)
{
long
l;
int c,
nc;
Tab
*t;
if(s ==
0)
return
0;
l =
wc;
nc =
0;
for(t=tab; t->cmask; t++)
{
nc++;
if(l <= t->lmask)
{
c =
t->shift;
*s = t->cval |
(l>>c);
while(c > 0)
{
c -=
6;
s++;
*s = 0x80 | ((l>>c) &
0x3F);
}
return
nc;
}
}
return
-1;
}
---------------------------
