Re: filename and normalization (was gcc identifiers)

Jungshik Shin Wed, 04 Dec 2002 11:19:30 -0800

On 4 Dec 2002, H. Peter Anvin wrote:

> By author:    Jungshik Shin <[EMAIL PROTECTED]>

> >   All right. That's what the *current* SUS/POSIX says. However, that
> > is hardly a solace to a user who'd be puzzled that two visually
> > identical and cannonically equivalent filenames are treated differently.

> There *is* no way to solve this problem.  You have the same kind of
> problem with U+0041 LATIN CAPTIAL LETTER A versus U+0391 GREEK CAPITAL
> LETTER ALPHA.  However, if you attempt normalizations you *will*

  U+0041, U+0391, and U+0410 are NOT  equivalent in any Unicode normalization
form. They're not even equivalent in NFK*.  Note that I didn't
just say visually (almost) identical but also modified it
with 'canonically equivalent'.

> introduce security holes in the system (as have been amply shown by
> Windows, even though *its* normalizations are even much simpler.)

  Therefore, your exmaple cannot be used to show that there's a security
hole(unless you're talking about applying normalization not specified
in Unicode) although it can be used to demonstrate that even after
normalization, there still could be user confusion because there are some
visually (almost) identical characters that would be treated differently.

  A better example for your case would be U+00C5(Latin captial
letter with ring above) and U+212B(Angstrom sign) or U+004B and
U+212A(Kelvin Sign). They're canonically equivalent.

> available to the user (ls -b or somesuch.)  Attempting
> canonicalization is doomed to failure, if nothing else when the next
> version of Unicode comes out, and you already have files that are
> encoded with a different set of normalizations.  Now your files cannot
> be accessed!  Oops!

 I might agree that normalization is not necessarily a good thing.
However, your cited reason is not so solid. Unicode Normalization form is
**permanenly frozen** for exisitng characters. And, UTC and JTC1/SC2/WG2
committed themselves not to encode any more precomposed characters that
can be represented with existing base char. and combining characters. If
you're not sure of their committment, perhaps using NFD is safer than
using NFC. Hmm.. that may be one of reasons why Apple chose NFD in Mac
OS X.

  BTW, without changing anything in Unix APIs and Unix filesystem(which
are not desirable anyway), shells 'might' be a good place to
'add' some normalization (per user-configurable option at the time
of invocation and with  env. variables)

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)

Reply via email to