Re: filename and normalization (was gcc identifiers)

H. Peter Anvin Wed, 04 Dec 2002 11:54:19 -0800

Followup to:  <[EMAIL PROTECTED]>
By author:    Jungshik Shin <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> > There *is* no way to solve this problem.  You have the same kind of
> > problem with U+0041 LATIN CAPTIAL LETTER A versus U+0391 GREEK CAPITAL
> > LETTER ALPHA.  However, if you attempt normalizations you *will*
> 
>   U+0041, U+0391, and U+0410 are NOT  equivalent in any Unicode normalization
> form. They're not even equivalent in NFK*.  Note that I didn't
> just say visually (almost) identical but also modified it
> with 'canonically equivalent'.


The whole point was that users could not care less what "normalization
form" is used -- to them, the same pattern of dots on the screen is the
same character.  My point is that no sensible normalization form is
ever going to solve that problem for you.  Also observe that
normalization, if done, has to be done in an identical matter
everywhere, which is impossible in the long run.

> > introduce security holes in the system (as have been amply shown by
> > Windows, even though *its* normalizations are even much simpler.)
> 
>   Therefore, your exmaple cannot be used to show that there's a security
> hole(unless you're talking about applying normalization not specified
> in Unicode) although it can be used to demonstrate that even after
> normalization, there still could be user confusion because there are some
> visually (almost) identical characters that would be treated differently.
> 
>   A better example for your case would be U+00C5(Latin captial
> letter with ring above) and U+212B(Angstrom sign) or U+004B and
> U+212A(Kelvin Sign). They're canonically equivalent.
> 
> > available to the user (ls -b or somesuch.)  Attempting
> > canonicalization is doomed to failure, if nothing else when the next
> > version of Unicode comes out, and you already have files that are
> > encoded with a different set of normalizations.  Now your files cannot
> > be accessed!  Oops!
> 
>  I might agree that normalization is not necessarily a good thing.
> However, your cited reason is not so solid. Unicode Normalization form is
> **permanenly frozen** for exisitng characters. And, UTC and JTC1/SC2/WG2
> committed themselves not to encode any more precomposed characters that
> can be represented with existing base char. and combining characters. If
> you're not sure of their committment, perhaps using NFD is safer than
> using NFC. Hmm.. that may be one of reasons why Apple chose NFD in Mac
> OS X.

I believe that committment just as much as I believed the Unicode 1.x
"Unicode will never be extended beyond 16 bits" committment.

Also, "for existing characters" isn't good enough.  Since it is
perfectly possible to enter characters that don't exist yet or more
specifically didn't exist when the OS was created (consider an NFS
mount shared between two macines with different OS releases) the
problem should be obvious.

>   BTW, without changing anything in Unix APIs and Unix filesystem(which
> are not desirable anyway), shells 'might' be a good place to
> 'add' some normalization (per user-configurable option at the time
> of invocation and with  env. variables)

Not the shell.  Input system, and possibly editors (including readlne
and its equivalents.)  The shell, or anything else that handles
\-escapes, should be augmented to handle \uXXXX and \UXXXXXXXX
escapes, and programs like ls should have a way to display such
escapes when an "odd" (noncanonical or otherwise unexpected) encoding
is encountered -- including such things as broken UTF-8 sequences,
noncharacters, and classical ASCII control characters.

    -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt    <[EMAIL PROTECTED]>
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

Reply via email to