On Sun, Nov 30, 2014 at 6:31 PM, Dmitrij D. Czarkoff <czark...@gmail.com> wrote:
> Joel Rees said:
>>> That said, the standard provides just enough facilities to make
>>> filesystem-related aspects of Unicode work nicely, particularily in case
>>> of utf-8.  Eg. ability to enforce NFD for all operations on file names
>>> could actually make several things more secure by preventing homograph
>>> attacks.
>>
>> I think this assertion is a bit optimistic, and not just given your
>> following caveat.
>
> Provided that I have to cope with Unicode file names every day,

Same here, FWIW, Japanese. (And then there are the times I have to
work on file names encoded in shift-JIS. Fun stuff.)

> I just
> can't see more pessimistic approach then just allowing arbitrary Unicode
> codepoints with no sanitization whatsoever.

Pessimistic? Optimistic? Asking for trouble, yes.

I generally try to use "Romaji" (latinized phonetic Japanese, all
ASCII, if I avoid the overbar approach to lengthened vowels) when I
know a file is going to move to another machine. If file names are
strictly phonetic, you can set up a round-trip mapping from Romaji to
kana, but most of the time Japanese file names include Kanji, and
there is no round-trip mapping that can be meaningfully read by a
human.

There are ASCII-encoded JIS codes which could be used to produce
round-trip mapping, but I'd need to run the output of ls through some
sort of a custom filter to make sense of the names. Might be a useful
thing to build.

> Every now and then I have
> to use printf(1) and xclip(1x) just because there is no other way to
> address a file or identify all codepoints of its name.  From here I
> don't see ability to enforce policy on Unicode strings as something as
> useless as you put it.

Not saying it's useless to have a policy.

What I'm saying is that unicode utf-8 has parsing problems independent
of issues like characters that appear the same but have separate code
points. utf-8 is pretty simple until you start mapping it to real
characters. Getting the mapping right is difficult, which is why you
have your policy, I think.

One of these days I want to build a ctype library that gives
meaningful results for the Japanese subset of the CJK subset of
Unicode. But that's only going to help with some of the problems.

-- 
Joel Rees

Be careful when you look at conspiracy.
Look first in your own heart,
and ask yourself if you are not your own worst enemy.
Arm yourself with knowledge of yourself, as well.

Reply via email to