Re: filename and normalization (was gcc identifiers)

H. Peter Anvin Wed, 04 Dec 2002 08:33:23 -0800

Followup to:  <[EMAIL PROTECTED]>
By author:    Jungshik Shin <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> >
> > Yes.  Filenames are byte sequences, period, full stop.  Any attempt at
> > normalization would violate SUS/POSIX.
> 
>   All right. That's what the *current* SUS/POSIX says. However, that
> is hardly a solace to a user who'd be puzzled that two visually
> identical and cannonically equivalent filenames are treated differently.
> 
> For instance, U+00D6(Latin Capital Letter O with diaresis) should look
> identical and be treated identically with U+004F foll. by U+0308. That's
> what users expect.  I don't know what's the best way to resolve
> this conflict. It may be time to consider seriously this particular
> aspect of SUS/POSIX.  I'm wondering how MacOS X (well, it's not 100%
> SUS/POSIX compliant, but nonetheless it's Unix) works in this area. It
> uses NFD. That is, 'U+00D6' is stored as 'U+004F U+0308' and both are
> treated idnetically.
>


There *is* no way to solve this problem.  You have the same kind of
problem with U+0041 LATIN CAPTIAL LETTER A versus U+0391 GREEK CAPITAL
LETTER ALPHA.  However, if you attempt normalizations you *will*
introduce security holes in the system (as have been amply shown by
Windows, even though *its* normalizations are even much simpler.)

The only possible answer is to make sure a decoded representation is
available to the user (ls -b or somesuch.)  Attempting
canonicalization is doomed to failure, if nothing else when the next
version of Unicode comes out, and you already have files that are
encoded with a different set of normalizations.  Now your files cannot
be accessed!  Oops!

        -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt    <[EMAIL PROTECTED]>
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

Reply via email to