Hi Markus,

This becomes even murkier.  W3C _was_ using NFC, as you say, but:

a)  When the SLP Project (successor to IETF Service Location WG)
    recently asked for advice about which normalization to use
    for SLP string compares, Harald Alvestrand -- author of 
    RFC 2277 "IETF Policy on Character Sets and Languages" and 
    RFC 3066 "Tags for the Identification of Languages" -- told
    us to use NFKC (which folds compatibility equivalents into
    their base characters).  Note that SLP service attributes
    frequently contain URLs, so this amounts to advice to use
    NFKC for comparing URLs.
b)  The latest "Stringprep Profile for Internationalized Host Names"
    <draft-ietf-idn-nameprep-07.txt> (9 January 2002)
    by Paul Hoffman (a Unicode and IETF guru) also uses NFKC.
    Paul is co-author of RFC 2781 "UTF-16, an encoding of ISO 10646".
    Note that IDN WG core specs are now in working group 'last call'.

NFC and NFD are at least reconcilable, without data loss.  NFKC
makes life much harder, if it creeps into file systems (because
it loses the ability to make round-trip transcoding _back_ to the
local system's legacy charset).

By the way, Harald Alvestrand is now the _Chair_ of the IESG, so
his recommendations carry considerable weight in IETF standards.

Cheers,
- Ira McDonald
  High North Inc


-----Original Message-----
From: Markus Kuhn [mailto:[EMAIL PROTECTED]]
Sent: Saturday, February 23, 2002 1:38 PM
To: [EMAIL PROTECTED]
Subject: Re: NFS4 requires UTF-8 


"Kent Karlsson" wrote on 2002-02-23 13:33 UTC:
> Also of interest here may be that, IIRC, HFS+ and UFS (the Apple
> file systems) represent all file names in NFD (and for UFS: in UTF-8).
> NFD, not NFC.

Oops, I didn't know that. That's far more of a concern when files are
exchanged between Macs and Linux. In particular since MacOS is in it's
latest incarnation just running on top of Berkeley Unix, I expect the
Mac platform to be far more frequently integrated with Unix systems, via
NFS, tar, pkzip, etc.

Alternative solutions:

 a) Linux goes NFD.
 b) MacOS goes NFC.
 c) Normalization when transfering files between the two worlds.
 d) Both sides learn to work well with either form.

The reasons for Linux prefering NFC were

  - That's far closer to existing practice with ISO 8859, JIS, etc.
  - The W3C has said the NFC shall be what the Web uses

and are as far as I can see still valid. The Linux world will in the
long run have to learn how to use combining characters anyway, as some
scripts depend on them (Thai most notably), so the occasional NFD file
from a Mac shouldn't cause major disruption. GUI file selection will run
as before, independent of coding variants, and for the shell I can see
numerous tiny improvements to globbing and the TAB filename expansion
mechanism to make handling the NFC/NFD difference far more convenient.

It would be nice, if the MacOS world and the Linux world used the same
convention, but if not, I think it is a matter of user interface
maturity, how easy it will be to deal with the difference.

Example:

You have two files

  M�ller
  M�llerin

in a directory, the first in NFD, the second in NFC. If you press M+TAB
in a yet to be written UTF-8 aware version of bash, it will fail to
expand to M�ller, as the two strings differ after the first letter.
Typing Mu+TAB will expand one, and typing M�+TAB will expand the other,
so there is a solution for experienced users.  A user interface
inprovement would be to provide two control keys that allows to scroll
through the list of files that are available in the current state of the
TAB selection. I could also imagine bash doing a normalization, such
that entering a prefix in one normalization will include the file name
in the other one as well. There are lots of ways to implement this in a
convenient way, and the only real problem is to get the bash maintainers
interested in UTF-8 at all ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to