2014/12/03 22:23 "Dmitrij D. Czarkoff" <czark...@gmail.com>:
>
> First of all, I really don't believe that preservation of non-canonical
> form should be a consideration for any software.

There is no particular canonical form for some kinds of software.

Unix, in particular, happens to have file name limitations that are
compatible with all versions of Unicode past 2.0, at least, in UTF-8, but
it has no native encoding. Most of the tools support ASCII, many now
support Unicode. But there is no native encoding. That's one of the
strengths of Unix.

> There is no single
> reason to allow non-canonical forms to exist at all,

non-canonical forms in what context?

> while there are
> several reasons to avoid them.

Which non-canonical forms?

> More so for foreign encodings in
> filenames -

Define foreign encoding, too. Make sure your definition works for my
context.

Now, if you don't mind keeping my data away from your machine, maybe it's
okay if your definition doesn't work for my context. For some 7 billion
definitons of "me".

> if you are trying to store UTF-16 names on a system with
> UTF-8 locale, you should be converting, not escaping.

Not much argument with that. Many things that can be done should not
necessarily be done.

Most of the time, anyway. There may be some special cases, but you are
talking about file names, and I don't think of any, right off the bat.

> Doing otherwise
> is just asking for troubles.

Oh, I just thought of a couple of exceptions. Theoretical at this point,
but definitely exceptions.

There's no rule that an OS has to use byte-string file names. (And you
don't have to do the stupid things a certain well-known OS does, that uses
UCS-16 as its native transform and Unicode as its native encoding.) But you
know that.

> Next, I assume that ability to enter filenames trumps ability to
> preserve original filename on Unix-like systems.

Entering file names is a function of the tools, not of the OS. And if you
want tools that are limited to NFD, you are free to build and use them.

> In most cases right
> now these two values don't clash, because user input is normalized from
> the very beginning in IME.

Choice, function, and construction of the input stack (and output stack) is
nearly completely independent of the OS (for any decent OS).

> That said, there may be exceptions.  Eg.
> several mail clients won't normalize filename if input encoding matches
> encoding of attachement.

Mail clients are also pretty independent of the OS.

> Thus, having recieved a file with non-ASCII
> filename from Mac, you'll end up being unable to address it from shell
> even if it was typed using exactly the same keyboard layout you use.

Keyboard layout is independent of the OS. And it is actually possible to
set up an openbsd keyboard and input method that closely mimics a Macintosh.

> I
> don't see how this situation may be justified.

Doesn't need to be. Only needs to be worked around.

> The rare cases when
> original filenames must be preserved byte to byte warrant some special
> handling (eg. storing filenames elsewhere separately or preserving the
> whole files with names and attributes in some archive or other form of
> special database).

Actually, the contexts in which data handling should be orthogonal to
filename encodings are the more common contexts. The OS has to do a lot
that the user never sees, and those internal functions just start fighting
each other when they start making assumptions like encodings.

> Finally, provided that both ends of network communication use canonical
> forms for Unicode, the matter of storing file remotely and then
> recieving it back with filename intact is simply a matter of
> normalization on reciever's side.

As long as you don't drop bytes somehow on the way from here to there.

> That is: if you prefer your local
> files in NFD, and your NAS uses NFC, you should simply normalize
> filenames when you recieve files back.

Not OS issues. Application issues. Maybe tool issues, for a limited subset
of tools.

> The only potential problem here
> is "compatibility" normalizations, but these are already problematic
> enough to be avoided in all cases where NFD or NFC do the job.

Broken compatibility normalizations get invented precisely because OS
architects think an OS needs a native encoding.

Remember, the Universal TransForms were invented independently of Unicode.
They were adopted by the Unicode Consortium about the time the Consortium
finally became convinced that there really are more than 65,536
character-like objects that need a code point in a modern information
encoding scheme.

UTF-8 and Unicode are not equivalent.

Joel Rees

Computer memory is just fancy paper,
CPUs just fancy pens.
All is a stream of text
flowing from the past into the future.

Reply via email to