Bruno Haible wrote on 2002-02-20 12:01 UTC:
> Where is the conversion between the NFS filenames and the user visible
> filenames (in locale encoding) to take place? Probably in the kernel,
> and the user-visible encoding will be given by a mount option?

I think they way they phrased it, they were primarily thinking about
that Unix files should always be in UTF-8, hence the word "transparent".

   The general file system model used for the NFS version 4 protocol is
   the same as previous versions.  The server file system is
   hierarchical with the regular files contained within being treated as
   opaque byte streams.  In a slight departure, file and directory names
   are encoded with UTF-8 to deal with the basics of
   internationalization.

Conversion can be lossy or awkward and thus an endless can of worms.
Let's not start that please.

11.  Internationalization

   The primary issue in which NFS needs to deal with
   internationalization, or I18n, is with respect to file names and
   other strings as used within the protocol.  The choice of string
   representation must allow reasonable name/string access to clients
   which use various languages.  The UTF-8 encoding of the UCS as
   defined by [ISO10646] allows for this type of access and follows the
   policy described in "IETF Policy on Character Sets and Languages",
   [RFC2277].  This choice is explained further in the following.

11.1.  Universal Versus Local Character Sets

   [RFC1345] describes a table of 16 bit characters for many different
   languages (the bit encodings match Unicode, though of course RFC1345
   is somewhat out of date with respect to current Unicode assignments).
   Each character from each language has a unique 16 bit value in the 16
   bit character set.  Thus this table can be thought of as a universal
   character set.  [RFC1345] then talks about groupings of subsets of
   the entire 16 bit character set into "Charset Tables".  For example
   one might take all the Greek characters from the 16 bit table (which
   are consecutively allocated), and normalize their offsets to a table
   that fits in 7 bits.  Thus it is determined that "lower case alpha"
   is in the same position as "upper case a" in the US-ASCII table, and
   "upper case alpha" is in the same position as "lower case a" in the
   US-ASCII table.

   These normalized subset character sets can be thought of as "local
   character sets", suitable for an operating system locale.

   Local character sets are not suitable for the NFS protocol.  Consider
   someone who creates a file with a name in a Swedish character set.
   If someone else later goes to access the file with their locale set
   to the Swedish language, then there are no problems.  But if someone
   in say the US-ASCII locale goes to access the file, the file name
   will look very different, because the Swedish characters in the 7 bit
   table will now be represented in US-ASCII characters on the display.
   It would be preferable to give the US-ASCII user a way to display the
   file name using Swedish glyphs. In order to do that, the NFS protocol
   would have to include the locale with the file name on each operation
   to create a file.

   But then what of the situation when there is a path name on the
   server like:

         /component-1/component-2/component-3

   Each component could have been created with a different locale.  If
   one issues CREATE with multi-component path name, and if some of the
   leading components already exist, what is to be done with the
   existing components?  Is the current locale attribute replaced with
   the user's current one?  These types of situations quickly become too
   complex when there is an alternate solution.

   If the NFS version 4 protocol used a universal 16 bit or 32 bit
   character set (or an encoding of a 16 bit or 32 bit character set
   into octets), then the server and client need not care if the locale
   of the user accessing the file is different than the locale of the
   user who created the file.  The unique 16 bit or 32 bit encoding of
   the character allows for determination of what language the character
   is from and also how to display that character on the client.  The
   server need not know what locales are used.

They also encode the owner and owner_group in UTF-8, not as integers. I
strongly suggest to use only the ASCII subset of UTF-8 here for the
foreseeable future. We *REALLY* do not want to get into locale-dependent
user names for all the hopefully obvious security implications.

http://www.ietf.org/rfc/rfc3010.txt

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to