On Sat, Feb 23, 2002 at 10:18:28AM +0900, Gaspar Sinai wrote:
> This was just a suggestion to clean up things by
> specifying the characters that can be allowed for
> filenames. Currently we can not have "/", ".", ".."
> and "\0" for a filename. What if we say we can not

More precisely, you can't have "." or ".." for a filename and you can
not have "/" and nul *in* filenames, and you can look at the first two
as "these files already exist" and not really a restriction as such.

> have composing and zero with characters for a filename?

Er, composing characters are OK, NFC just avoids them when there's a
precomposed alternative available.  (And Pablo said that there are some
zero-width characters that are useful in filenames ... which is rather
annoying.)

Why can't we do that? Because filenames would go from being nearly
8-bit clean to having UTF-8 specific requirements.  That's not the FS's
job.  And this couldn't only by NFS: the problems you're describing
would happen with local FS's, too--and they need to work with all
active charsets, not just UTF-8.

> That would not need compicated normalization - just
> a character check.

The current restrictions on filenames have been around forever, are
unavoidable, and are the only things keeping filenames from being
completely 8-bit clean.  (Normalization involves changing text, as
well; the existing restrictions are simply pass or fail.)

Aside: can a UTF-8 string ever grow longer due to being changed to NFC?
It's obvious that a wide char string can't, but it's not clear that this
holds with UTF-8 (and if so, that it always will.)

> The problem occurs if normalization does happen - and some programs
> may do normalization.

If any are normalizing to NFD, they should probably be changed to not do
that.  Fixing that isn't the FS's job.

But the filesystem, C library calls, network protocols, etc. should
*never* change filenames at all.  That stuff must remain 8-bit clean
(as far as it is now.)

I'm not advocating any low-level constraints or normalization at all.  I
just want to be able to use UTF-8 in filenames, without hitting filenames
that I can't use c+p to enter.  That's not the FS's job to fix, it's the
interface's.  The simple solution, have tools escape zero-width chars
and other oddities, isn't quite good enough, due to some of these 
characters being useful in filenames.  (I might settle for it myself--I
don't use any languages that need them--but it'd be nice to find a more
general solution.)  

This isn't a new problem, it's new symptoms of an old one.  The old ones
were fixed by escaping invalid byte sequences, spaces, and ASCII control
characters--the new symptoms just need to be worked out.  (Invalid UTF-8
sequences aren't one of these new problems--ls already escapes those.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to