Re: [OpenAFS-devel] AFS vs UNICODE

Roland Kuhn Wed, 07 May 2008 00:28:08 -0700

Hi Jeffrey!

On 7 May 2008, at 00:49, Jeffrey Altman wrote:

[EMAIL PROTECTED] wrote:

This problem is nothing unicode-specific, the users can easily create
file names even in plain ascii which are visually indistinguishable.
(easiest with certain fonts :)

As soon as application software can list files and let the user pick one,

it is no longer a remarkable problem in practice.


This is not true since the user interfaces on each of the operating
systems will all represent the strings to the user as the same name.
This is not a font issue.

And it is also not a filesystem issue. I agree that there is a problem, but I think we differ concerning the level on which it should be solved.

(2) Since the directory lookups are performed using a hash table, a file with the name being searched for might exist but it cannot be found because the input to the hash function on client B is different than the input used to create the entry on client A.
If the name is a byte sequence, this can not happen, you imply that
the file name _is_ a character string.
A file name from the perspective of the user is a character string.
The user types in a name via the user interface and the user interface
determines how to represent that name not the user. If the user enters the name on a MacOS X system she will get a UNICODE sequence that is in
decomposed form.  If the user enters the same name on Windows she will
get a UNICODE sequence that is in composed form.

If the user tries to access her files from both machines she will have
interop problems.

I beg to differ: the representation of the file name will differ according to where the file was created, but accesses afterwards _must_ work nevertheless. Each system can read the correct representation from the directory to be able to open the file.

(Of course, applications do read user input as text - to create new files,
but most often not for opening existing files.)
Compatibility in file naming (saved at one occation should be readable
at another, possibly on another computer and by another program)
belongs at the application level. File naming compatibility does not differ
essentially from compatibility of file contents.
We already have evidence to the contrary.

Well, there are broken operating systems as well as broken applications. Let's not complement that by broken filesystems.

Storing file names as opaque octet sequences is broken in other ways. Depending on the character set used on the client the file name might or might not be representable since the octet sequence contains no indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,
ISO-Latin-9, UTF-7, UTF-8, etc.
This is just the result of broken practices - using limited and thus
incompatible encodings ultimately leads to breakage and no efforts
can eliminate the pain afterwards.
Correct.  But with Unicode we do have the ability to eliminate the
problems associated with (a) no normalization; (b) decomposed normalization; and (c) composed normalization.

How do you know you're dealing with Unicode in the first place? Imagine a latin1 file name which incidentally does not violate the UTF-8 rules, but happens to be not normalized. Normalizing it will simply destroy it.

The same file can be opened by two processes running with different locales,
on the same computer and even at the same time.
There is hardly any information about file name encoding in an open()
system call. How does the file system know which encoding is used by
a particular process for a particular open()?
There is no knowledge at the open() or CreateFile() level. There is extensive knowledge at the user interface level.

Exactly. So that is the place where this problem is to be solved.

Ciao,
                    Roland

--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both.  - Benjamin Franklin
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12

GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M+ ! V Y+

PGP++ t+(++) 5 R+ tv-- b+ DI++ e++++ h---- y+++
------END GEEK CODE BLOCK------

PGP.sig
Description: This is a digitally signed message part

Re: [OpenAFS-devel] AFS vs UNICODE

Reply via email to