Dmitrij had some questions about my intent, I'll try to clarify.

2014/12/02 18:57 "Joel Rees" <joel.r...@gmail.com>:
>
> (apologies for the html.)
>
> 2014/12/02 9:52 "Dmitrij D. Czarkoff" <czark...@gmail.com>:
[ ... and others
Snipped context:
  There was some discussion of what kind of file names should be allowed to
be stored.
  There was something I read as a suggestion for using a normal form based
in Unicode as a target for enforced file name conversion.
  There were some attempts to discuss reasons why file names should not be
forceably converted.

  And then communication seemed to really break down when I tried to
present a semi-obvious example of why seemingly innocuous conversions turn
out to be not so innocuous after all.]

And, since that didn't work, I tried with an example closer to the
suggested normal form:

> > Joel Rees said:
> > > Now, what would you do with this?
> > >
> > > ジョエル
> > >
> > > Why not decompose it to the following?
> > >
> > > ジョエル

Which didn't communicate the problem, either.

> > Because it is not what Unicode normalization is.
>
> Well, it definitely isn't Unicode normalization. And there is a reason,
it isn't, even though there
> were many who thought the Unicode standard shouldn't include code points
for wide form glyphs.
>
> Let's try one more. I think you have said enough that I can infer that
your preferred normal form is
> the decomposit form. So, given that your normalization has resulted in a
file named
>
> シ゛ョエルの歌
>
> and

given

> the necessity to send it back where it came from, how do you know whether
or not it should
> be restored to
>
> ジョエルの歌
>
> before you send it back?
>
> > [...]

But normalization is a red herring in this context.

You may personally have no problems with filename conversions improperly
done, but I am not willing to take them lightly where my data is concerned.
I may have a NAS device that I'm using for backup without
compression/amalgamation (i. e., tar/zip), and If I have a file with a
decomposit name backed up on the NAS, I don't want it automatically
converted to composit when it is restored, the existence of normal forms
notwithstanding.

Unix file names can handle UTF-8 encoded Unicode file names without losing
data because no conversion is necessary. There may be issues with
displaying them, but the file name itself is safe, because '/' is always
'/' and '\0' is always '\0'.

You can even handle broken UTF-8 and unconverted UTF-16/32 of whatever byte
order spit into the file name as a sequence of bytes if and only if you
escape NUL, slash, and your escape character properly, restoring the
escaped characters when putting the file names on the network.

Normalization alone does not know how to restore a potentially normalized
name. It needs some sort of flag character that says "this name was
normalized", and a way to choose between de-normalized forms when more than
one denormalized form maps to one particular normal form.

The last time I looked, the Unicode standard itself stated that this was
the case, and that normalized forms were not recomended for such purposes.
The craziness currently infecting the entire industry leaves me with no
confidence that such is still the case.

I haven't used Apple OSses since around 10.4, but Mac OS X was doing a
thing where certain well-known directory names were aliased according to
the current locale. For instance, the user's  "music" directory was shown
as 「音楽」 when the locale was set to ja_JP.UTF-8. This is useful to
desktop
users, but is sometimes confusing when you log in via ssh from a terminal
that does not display Japanese and fails to declare itself as such. It's
convenient, but even this can cause problems when backing up the entire
home or user directory, if the backup software doesn't know to ask for the
OS canonical name.

Again, apologies for using my (erk) Android device and spitting html at the
list.

Joel Rees

Computer memory is just fancy paper,
CPUs just fancy pens.
All is a stream of text
flowing from the past into the future.

Reply via email to