Dave Botsch wrote:
(1) The same text in Unicode can be represented by different sequences of characters. As a result you could have client A and client B both create a file with the same name that can not be visually distinguished by the end user. Now which one do you open?If everyone is using UTF-8 encoding, does the above problem still exist? And by UTF-8, I mean the "real" UTF-8 encoding, not one of the many variants (which would mean that the oafs client would have to do some translation)?
UTF-8 is simply an encoding of UNICODE. The problem is with UNICODE. I suggest you read the UNICODE specification and in particular Annex 15 http://unicode.org/reports/tr15/
(2) Since the directory lookups are performed using a hash table, a file with the name being searched for might exist but it cannot be found because the input to the hash function on client B is different than the input used to create the entry on client A.It would be different because of #1 above?
no
Storing file names as opaque octet sequences is broken in other ways. Depending on the character set used on the client the file name might or might not be representable since the octet sequence contains no indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,ISO-Latin-9, UTF-7, UTF-8, etc.So, if we know what sequence we're using...? How do local filesystems handle this? I might very well create a file on my ext3 filesystem usbkey with my locale set to ISO Latin-1 then try to access it from another box with the charset set to UTF-16 (or something completely different). Or maybe I named the file using some non-arabic character set?
They do not handle it.
I've read that Java uses a "modified UTF-8" which can cause there to be 6 instead of 4 octets per character... how does this not break other applications from being able to access the files on the local filesystem?
There is nothing modified about UTF-8 encoding requiring up to 6 octets.
smime.p7s
Description: S/MIME Cryptographic Signature
