Glenn Maynard wrote on 2002-02-21 08:10 UTC:
> One thing that's bound to be lost in the transition to UTF-8 filenames:
> the ability to reference any file on the filesystem with a pure CLI.

I can generate plenty of file names with ISO 8859-1 that you will have
troubles typing in. Try a file name that starts with CR or NBSP just to
warm up. Nothing new with UTF-8 here. Keep it simple.

> If I see a file with a pi symbol in it, I simply can't type that; I have
> to copy and paste it or wildcard it.

How does pi differ from � or � in that respect? Nothing new here.

> If I have a filename with all Kanji, I can only use wildcards.

Just like with the file ��������� I guess. Has that been a problem
in practice so far?

> A normalization form would help a lot, though. It'd guarantee that in
> all cases where I *do* know how to enter a character in a filename,
> I can always manipulate the file.  (If I see "c�r", I'd be able to "cat
> c�r" and see it, reliably.)

We agreed already ages ago here that Normalization Form C should be
considered to be recommended practice under Linux and on the Web. But
nothing should prevent you in the future from using arbitrary opaque
byte strings as POSIX file names. In particular, POSIX forbids that the
file system applies any sort of normalization automatically. All the URL
security issues that IIS on NTFS had demonstrates, what a wise decision
that was.

> I don't know who would actually normalize filenames, though--a shell
> can't just normalize all args (not all args are filenames) and doing it
> in all tools would be unreliable.

Please do not even think about automatically normalizing file names
anywhere. There is absolutely no need for introducing such nonsense, and
deviating from the POSIX requirement that filenames be opaque byte
strings is a Bad Idea[TM] (also known as NTFS).

> A mandatory normalization form would also eliminate visibly duplicate
> filenames.

No, it won't. Unicode normalization will not eliminate homoglyphs and
can't possibly. You try to apply the wrong tool to the wrong problem.
Again nothing new here. We have lived happily for over a decade with the
homoglyphs SP and NBSP in ISO 8859-1 in POSIX file systems. Security
problems have arousen in file systems that attempted to do case
invariant matching and other forms of normalization and now we know that
that was a bad idea (see the web attack log I posted here 2002-02-14
as one example).

> Of course, it can't be enforced, but tools that escape
> filenames for output could change unnormalized text to \u/\U.

For the shell and ls, some amount of escaping that is compatible with
shell notation is certainly a good idea. It will also include \x for
malformed UTF-8 sequences and ASCII control characters.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to