Bruno Haible wrote on 2002-02-20 12:01 UTC:
> Where is the conversion between the NFS filenames and the user visible
> filenames (in locale encoding) to take place? Probably in the kernel,
> and the user-visible encoding will be given by a mount option?
I think they way they phrased it, they were primarily thinking about
that Unix files should always be in UTF-8, hence the word "transparent".
The general file system model used for the NFS version 4 protocol is
the same as previous versions. The server file system is
hierarchical with the regular files contained within being treated as
opaque byte streams. In a slight departure, file and directory names
are encoded with UTF-8 to deal with the basics of
internationalization.
Conversion can be lossy or awkward and thus an endless can of worms.
Let's not start that please.
11. Internationalization
The primary issue in which NFS needs to deal with
internationalization, or I18n, is with respect to file names and
other strings as used within the protocol. The choice of string
representation must allow reasonable name/string access to clients
which use various languages. The UTF-8 encoding of the UCS as
defined by [ISO10646] allows for this type of access and follows the
policy described in "IETF Policy on Character Sets and Languages",
[RFC2277]. This choice is explained further in the following.
11.1. Universal Versus Local Character Sets
[RFC1345] describes a table of 16 bit characters for many different
languages (the bit encodings match Unicode, though of course RFC1345
is somewhat out of date with respect to current Unicode assignments).
Each character from each language has a unique 16 bit value in the 16
bit character set. Thus this table can be thought of as a universal
character set. [RFC1345] then talks about groupings of subsets of
the entire 16 bit character set into "Charset Tables". For example
one might take all the Greek characters from the 16 bit table (which
are consecutively allocated), and normalize their offsets to a table
that fits in 7 bits. Thus it is determined that "lower case alpha"
is in the same position as "upper case a" in the US-ASCII table, and
"upper case alpha" is in the same position as "lower case a" in the
US-ASCII table.
These normalized subset character sets can be thought of as "local
character sets", suitable for an operating system locale.
Local character sets are not suitable for the NFS protocol. Consider
someone who creates a file with a name in a Swedish character set.
If someone else later goes to access the file with their locale set
to the Swedish language, then there are no problems. But if someone
in say the US-ASCII locale goes to access the file, the file name
will look very different, because the Swedish characters in the 7 bit
table will now be represented in US-ASCII characters on the display.
It would be preferable to give the US-ASCII user a way to display the
file name using Swedish glyphs. In order to do that, the NFS protocol
would have to include the locale with the file name on each operation
to create a file.
But then what of the situation when there is a path name on the
server like:
/component-1/component-2/component-3
Each component could have been created with a different locale. If
one issues CREATE with multi-component path name, and if some of the
leading components already exist, what is to be done with the
existing components? Is the current locale attribute replaced with
the user's current one? These types of situations quickly become too
complex when there is an alternate solution.
If the NFS version 4 protocol used a universal 16 bit or 32 bit
character set (or an encoding of a 16 bit or 32 bit character set
into octets), then the server and client need not care if the locale
of the user accessing the file is different than the locale of the
user who created the file. The unique 16 bit or 32 bit encoding of
the character allows for determination of what language the character
is from and also how to display that character on the client. The
server need not know what locales are used.
They also encode the owner and owner_group in UTF-8, not as integers. I
strongly suggest to use only the ASCII subset of UTF-8 here for the
foreseeable future. We *REALLY* do not want to get into locale-dependent
user names for all the hopefully obvious security implications.
http://www.ietf.org/rfc/rfc3010.txt
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/