--On Sunday, July 20, 2008 12:09:18 AM -0400 Jeffrey Altman <[EMAIL PROTECTED]> wrote:

Mattias Pantzare wrote:
2008/7/19 Jeffrey Altman <[EMAIL PROTECTED]>:
The Windows client code is correct.  The question is how we are going to
deal with this stuff
for platforms where the process locale is not guaranteed to be UTF-8.
We need to figure out
how ZFS which does Unicode normalization is handling this.

If you tell ZFS to do unicode normailzation on a filesystem you have
to use UTF-8.

Search for normalization on this page:
http://docs.sun.com/app/docs/doc/819-2240/zfs-1m?a=view

After speaking with one of the relevant developers from Sun, the NFS and
CIFS file servers will enforce the use of UTF-8 as well if the data set
has been tagged to be Unicode.

We might be able to do something similar with volumes or directories
that are tagged to be Unicode only.

Actually, I think there is a fairly simple behavior we can use that will do something useful, based on a previous discussion (possibly with the same Sun developer) about ZFS....



As you're doing now with Windows, when creating a file, use exactly what was passed in from the upper layer. This might be UTF-8, in some arbitrary normalization, or it might be something else.

When looking up existing names, prefer an exact octet-wise match, as you're doing now with Windows. This will allow disambiguation of multiple differently-normalized UTF-8 names, and will also allow lookup of filenames in other 8-bit charsets, provided the application and OS give you the name exactly as it appears (this is not as unlikely as it sounds; often the name given you will be one that was selected from a GUI display of names you provided, and so will be exactly correct even if nothing knows what charset was really in use).

If the exact lookup fails, but the requested name is valid UTF-8, try a normalized lookup. Of course, you can only compare the name against directory entries that are also valid UTF-8; other entries will fail.


If you do this, you get these properties...

- When ASCII is used, everything will Just Work(tm)
- When UTF-8 is used, everything will Just Work(tm)
- When legacy 8-bit charsets are used, things will always work if everyone
 agrees on the charset in use, and will often work well enough even if
 not. This is no worse than the situation today.

What you do _not_ get is the ability to pass in a UTF-8 filename and have a lookup succeed when the filename is actually represented in a legacy charset, or vice versa. This essentially means that transition from a legacy 8-bit character set to UTF-8 will be painful.

In practice, I think we can ease this pain by providing mechanisms to allow server admins, client admins, client users, and/or content owners to advertise a legacy charset that is in use, probably at the volume, server, or cell level. This information can be used by clients to convert between UTF-8 and the advertised legacy charset for the purpose of doing lookups. Of course, even in this case, new names should always be stored exactly as given, without conversion.


-- Jeff
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to