Followup to: <[EMAIL PROTECTED]>
By author: Jungshik Shin <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> >
> > Yes. Filenames are byte sequences, period, full stop. Any attempt at
> > normalization would violate SUS/POSIX.
>
> All right. That's what the *current* SUS/POSIX says. However, that
> is hardly a solace to a user who'd be puzzled that two visually
> identical and cannonically equivalent filenames are treated differently.
>
> For instance, U+00D6(Latin Capital Letter O with diaresis) should look
> identical and be treated identically with U+004F foll. by U+0308. That's
> what users expect. I don't know what's the best way to resolve
> this conflict. It may be time to consider seriously this particular
> aspect of SUS/POSIX. I'm wondering how MacOS X (well, it's not 100%
> SUS/POSIX compliant, but nonetheless it's Unix) works in this area. It
> uses NFD. That is, 'U+00D6' is stored as 'U+004F U+0308' and both are
> treated idnetically.
>
There *is* no way to solve this problem. You have the same kind of
problem with U+0041 LATIN CAPTIAL LETTER A versus U+0391 GREEK CAPITAL
LETTER ALPHA. However, if you attempt normalizations you *will*
introduce security holes in the system (as have been amply shown by
Windows, even though *its* normalizations are even much simpler.)
The only possible answer is to make sure a decoded representation is
available to the user (ls -b or somesuch.) Attempting
canonicalization is doomed to failure, if nothing else when the next
version of Unicode comes out, and you already have files that are
encoded with a different set of normalizations. Now your files cannot
be accessed! Oops!
-hpa
--
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[EMAIL PROTECTED]>
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/