Re: [OpenAFS-devel] AFS vs UNICODE

Jeffrey Hutzelman Mon, 28 Jul 2008 23:36:11 -0700

--On Sunday, July 20, 2008 12:09:18 AM -0400 Jeffrey Altman<[EMAIL PROTECTED]> wrote:

Mattias Pantzare wrote:

2008/7/19 Jeffrey Altman <[EMAIL PROTECTED]>:

The Windows client code is correct.  The question is how we are going to
deal with this stuff
for platforms where the process locale is not guaranteed to be UTF-8.
We need to figure out
how ZFS which does Unicode normalization is handling this.


If you tell ZFS to do unicode normailzation on a filesystem you have
to use UTF-8.

Search for normalization on this page:
http://docs.sun.com/app/docs/doc/819-2240/zfs-1m?a=view


After speaking with one of the relevant developers from Sun, the NFS and
CIFS file servers will enforce the use of UTF-8 as well if the data set
has been tagged to be Unicode.

We might be able to do something similar with volumes or directories
that are tagged to be Unicode only.

Actually, I think there is a fairly simple behavior we can use that will dosomething useful, based on a previous discussion (possibly with the sameSun developer) about ZFS....

As you're doing now with Windows, when creating a file, use exactly whatwas passed in from the upper layer. This might be UTF-8, in some arbitrarynormalization, or it might be something else.

When looking up existing names, prefer an exact octet-wise match, as you'redoing now with Windows. This will allow disambiguation of multipledifferently-normalized UTF-8 names, and will also allow lookup of filenamesin other 8-bit charsets, provided the application and OS give you the nameexactly as it appears (this is not as unlikely as it sounds; often the namegiven you will be one that was selected from a GUI display of names youprovided, and so will be exactly correct even if nothing knows what charsetwas really in use).

If the exact lookup fails, but the requested name is valid UTF-8, try anormalized lookup. Of course, you can only compare the name againstdirectory entries that are also valid UTF-8; other entries will fail.



If you do this, you get these properties...

- When ASCII is used, everything will Just Work(tm)
- When UTF-8 is used, everything will Just Work(tm)
- When legacy 8-bit charsets are used, things will always work if everyone
 agrees on the charset in use, and will often work well enough even if
 not. This is no worse than the situation today.

What you do _not_ get is the ability to pass in a UTF-8 filename and have alookup succeed when the filename is actually represented in a legacycharset, or vice versa. This essentially means that transition from alegacy 8-bit character set to UTF-8 will be painful.

In practice, I think we can ease this pain by providing mechanisms to allowserver admins, client admins, client users, and/or content owners toadvertise a legacy charset that is in use, probably at the volume, server,or cell level. This information can be used by clients to convert betweenUTF-8 and the advertised legacy charset for the purpose of doing lookups.Of course, even in this case, new names should always be stored exactly asgiven, without conversion.



-- Jeff
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] AFS vs UNICODE

Reply via email to