Re: UTF-8 as the single common encoding everywhere

Frank da Cruz Wed, 06 Jun 2001 14:41:07 -0700
Markus Kuhn wrote:

> The father's of Unix have been doing that since back in 1992 in the
> Plan9 Unix successor operating system and it is a shame how long it
> takes Linux and BSD to follow here.
> 
But for obvious reasons.  The UNIXes have branched away from each other
a long time ago.  Plan 9 might have been a good replacement but it didn't
catch on.  Even so, I'm not so sure that the issues I listed are considered
in Plan 9, which also limits itself to Level 1.

Meanwhile, adding UTF-8 support to the other UNIXes will be done (or not)
independently, slowly, and inconsistently: Not just the BSDs, but also
Solaris, HP-UX, AIX, IRIX, Unixware, Mac OS X, Tru64, QNX, SINIX, and all
the rest.

> The issues you mention of little relevance. (a) Unix does not have a
> case-insensitive file system. (b) At the moment, we focus for the file
> system only on ISO 10646 Level 1, that is no no combining characters.
>
That makes it easy, of course.  But as quickly as this is implemented,
the demands will follow for full Unicode / ISO 10646, and then come all
the complications of canonical ordering, combining characters, nonzero
planes, surrogates, database lookups to find out the properties of each
character, composition, decomposition, sorting of sequences of combining
characters, different versions of the database, and so on.

> > Nor to mention the many and varied "versions" of UTF-8,
> 
> There is only one version of UTF-8. What are you talking about?
> 
The UTF-8 which allows non-shortest sequences to be read versus the one that
does not.  The UTF-8 which emits non-shortest sequences versus one that does
not.  The UTF-8 which "decodes" surrogates versus the one that treats them as
if they were regular UCS-2 characters.  The UTF-8 that is limited to 6-bit
seqences versus the unlimited one.  There's a lot of talk on the Unico[dr]e
lists about this recently, and new proposals for modifications to UTF-8
surface all the time (the current hot topic is UTF-8S, proposed by Oracle).

> No special kernel or file system semantics is required for handling
> UTF-8 whatsoever. File names are just opaque byte strings in which '/'
> and '\0' are the only bytes with a special meaning. That's ALL.
>
As long as you restrict the implementation to Level 1.  Even then you can
get into trouble with filenames (or URLs, etc) that "look like" another
filename.  For example, by substituting Cyrillic A for Latin A in
Amazon.com, I can trick into coming to my website instead of the real one,
or accessing the wrong file.  That's only one example.  More subtle ones
occur with combining characters.  The user has no way of telling what the
underlying byte sequence is by looking at the graphical representation.
(The fact that Linux doesn't care about combining characters doesn't mean
terminal emulators and Web browsers and FTP clients that access the Linux
file system won't care.)

Even assuming everybody is trustworthy and intends no tricks, it is still
very possible that two different UTF-8 aware clients on different platforms
will "spell" the same name two different ways: one client enforces
Normalization Form C, and the other enforces Normalization Form D, yielding
two different spellings for every German word that contains an Umlaut.  You
type it the same way on your German keyboard, it looks the same on your
screen, but the underlying representation is different, resulting in two
different names that reach Linux from these two clients.

> > Obviously we have the same issues in network domain names
> 
> There are already draft IETF standards for Unicode DNS names.
> ...
> I'm sceptical myself about how good an idea it is to use Unicode in
> email addresses and DNS, because case mapping becomes an issue here
> indeed. Things are much simpler for the file system.
> 
I predict that the hard issues will have to be tackled eventually, and I
believe it's important that whatever rules and procedures are adopted for
associating UTF-8 character strings with domain names, e-mail address, file
and directory names, and all the rest, should be uniform.  Otherwise we'll
have unbelievable interoperability and security problems.  DNS is forced to
grapple with all these problems right away, simply because of case
independence and because of the obvious security risks (such as website
masquerading by sending people "disguised" URLs to click on).

Personally, I would be happy to stick with UCS-2 and ISO 10646-1 Level 1,
as I have done so far in the Kermit code.  The rest is going to take us
into a whole new and strange realm.

Not that it shouldn't be done!  But after we enter this world, programming
will never be the same...

- Frank

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 as the single common encoding everywhere

Reply via email to