Re: UTF-8 as the single common encoding everywhere

Markus Kuhn Wed, 06 Jun 2001 13:45:42 -0700
Frank da Cruz wrote on 2001-06-06 20:21 UTC:
> > one day, when UTF-8 has become the almost exclusively used encoding...
> >
> I haven't been following this stuff at all, but I assume that when this
> happens, we'll also have a UTF-8 based file system, with file and directory
> names in UTF-8, and all the rest.

Of course. That's the only way! UTF-8 will replace ASCII *AT ALL
LEVELS*! Directory names, environment variable content, C source code,
HTML pages. Everything where you use ASCII today. Absolutely everything!

The father's of Unix have been doing that since back in 1992 in the
Plan9 Unix successor operating system and it is a shame how long it
takes Linux and BSD to follow here.

Classic reading:

  Rob Pike and Ken Thompson, USENIX 1992:
  ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz

If you have a Greek or Cyrillic keyboard, then Greek or Cyrillic file
names are a perfectly natural thing to use. If users of Greek and
Cyrillic keyboards share directories (as they do with public ftp
servers), then agreeing on a common encoding is the only way to avoid
"mojibake" (apparently Japanes slang for "garbled text due to mismatched
character encodings").

> Yikes!  Imagine the possibilities for
> mistaking one file for another, not finding a desired file, deleting the
> wrong file, etc, unless absolutely *everybody* agrees on *exactly* how at
> least the following things are handled:
> 
>  . Case mapping on case-insensitive file systems
>  . Canonical composition or decomposition
>  . Canonical ordering of combining characters

The issues you mention of little relevance. (a) Unix does not have a
case-insensitive file system. (b) At the moment, we focus for the file
system only on ISO 10646 Level 1, that is no no combining characters.
Details:

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

> Not to mention issues of sorting and collation, e.g. for listing files
> in "alphabetical" order.

You have already different sorting orders today in ISO 8859-1,9,15
locales. UTF-8 doesn't really add much new complexity to long standing
practice here.

> Nor to mention the many and varied "versions" of UTF-8,

???

There is only one version of UTF-8. What are you talking about?

> or the eternally shifting landscape of Unicode/ISO10646 itself.

Unicode / ISO 10646 is no doubt the by far best documented large
character set on this planet. Just look at the mess in various CJK sets
(yen/backslash confusion, lots of bugs in various GB18030 versions,
etc.)

> Even if Linux gets it right, then we have cross-platform issues such as
> NFS mounts, FTP, and so on.

No special kernel or file system semantics is required for handling
UTF-8 whatsoever. File names are just opaque byte strings in which '/'
and '\0' are the only bytes with a special meaning. That's ALL. Neither
the kernel nor ftp give a damn in which character encoding a file name
is. That is why it is such a good idea that everyone should use the same
encoding, and UTF-8 is really the only globally acceptable candidate for
that.

Of course, the other Unices will also need decent UTF-8 locale support,
UTF-8 terminal emulators, etc. Linux has it since glibc 2.2, Solaris and
AIX had it even before that. Only the BSD community doesn't seem to have
UTF-8 support yet, but I hope that will be fixed as well soon.

> I assume some group somewhere is working on all this..

Assume nothing. It's us here on linux-utf8 and nobody else. There is
nobody else coming later to define "the official proper solution",
because nobody else has more practical experience with all this than the
linux-utf8 crowd that has been playing around with ubiquitous UTF-8
under Linux for ~2 years now.

Recommended reading: "Where Wizards Stay Up Late" is an excellent
biography of the Internet back to the late 1960s ARPANet days. There you
will read how all the three-digit RFCs were literally written as very
informal notes and minutes of discussions between a few graduate
students. They all though that "some group somewhere" is working on the
proper standards for the ARPANet, will come in any minute and tell them
what the real architecture of the Internet will be. Nobody ever came,
and we are still running on these rather informally written "requests
for comments".

The same will happen with UTF-8 under Unix. The experimental design
decisions that we make here on things such as how to extend xterm and
the X11 selection and font mechanisms for UTF-8 usage are likely to
influence common practice for many years to come.

> The mind boggles.

Don't be so pessimistic. Most of the problems that you mentioned are
really non-issues. There are indeed a few slightly more tricky ones but
you probably haven't spotted these yet. Regular expressions and wcwidth
are examples, and we have reasonably good first generation solutions
established now. The world will not be quite as simple as with ASCII for
sure, but it will be considerably simple than with the many different
encodings that we have to cope with now.

> Obviously we have the same issues in network domain names

There are already draft IETF standards for Unicode DNS names.

> email addresses

RFC 2822 has not yet tackled these issues.

I'm sceptical myself about how good an idea it is to use Unicode in
email addresses and DNS, because case mapping becomes an issue here
indeed. Things are much simpler for the file system.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 as the single common encoding everywhere

Reply via email to