Hi Jeffrey! On 7 May 2008, at 00:49, Jeffrey Altman wrote:
And it is also not a filesystem issue. I agree that there is a problem, but I think we differ concerning the level on which it should be solved.[EMAIL PROTECTED] wrote:This problem is nothing unicode-specific, the users can easily create file names even in plain ascii which are visually indistinguishable. (easiest with certain fonts :)As soon as application software can list files and let the user pick one,it is no longer a remarkable problem in practice.This is not true since the user interfaces on each of the operating systems will all represent the strings to the user as the same name. This is not a font issue.
I beg to differ: the representation of the file name will differ according to where the file was created, but accesses afterwards _must_ work nevertheless. Each system can read the correct representation from the directory to be able to open the file.(2) Since the directory lookups are performed using a hash table, a file with the name being searched for might exist but it cannot be found because the input to the hash function on client B is different than the input used to create the entry on client A.If the name is a byte sequence, this can not happen, you imply that the file name _is_ a character string.A file name from the perspective of the user is a character string. The user types in a name via the user interface and the user interfacedetermines how to represent that name not the user. If the user enters the name on a MacOS X system she will get a UNICODE sequence that is indecomposed form. If the user enters the same name on Windows she will get a UNICODE sequence that is in composed form. If the user tries to access her files from both machines she will have interop problems.
Well, there are broken operating systems as well as broken applications. Let's not complement that by broken filesystems.(Of course, applications do read user input as text - to create new files,but most often not for opening existing files.)Compatibility in file naming (saved at one occation should be readableat another, possibly on another computer and by another program)belongs at the application level. File naming compatibility does not differessentially from compatibility of file contents.We already have evidence to the contrary.
How do you know you're dealing with Unicode in the first place? Imagine a latin1 file name which incidentally does not violate the UTF-8 rules, but happens to be not normalized. Normalizing it will simply destroy it.Storing file names as opaque octet sequences is broken in other ways. Depending on the character set used on the client the file name might or might not be representable since the octet sequence contains no indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,ISO-Latin-9, UTF-7, UTF-8, etc.This is just the result of broken practices - using limited and thus incompatible encodings ultimately leads to breakage and no efforts can eliminate the pain afterwards.Correct. But with Unicode we do have the ability to eliminate theproblems associated with (a) no normalization; (b) decomposed normalization; and (c) composed normalization.
The same file can be opened by two processes running with different locales,on the same computer and even at the same time. There is hardly any information about file name encoding in an open() system call. How does the file system know which encoding is used by a particular process for a particular open()?There is no knowledge at the open() or CreateFile() level. There is extensive knowledge at the user interface level.
Exactly. So that is the place where this problem is to be solved.
Ciao,
Roland
--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both. - Benjamin Franklin
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M+ !
V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e++++ h---- y+++ ------END GEEK CODE BLOCK------
PGP.sig
Description: This is a digitally signed message part
