Dmitrij had some questions about my intent, I'll try to clarify. 2014/12/02 18:57 "Joel Rees" <joel.r...@gmail.com>: > > (apologies for the html.) > > 2014/12/02 9:52 "Dmitrij D. Czarkoff" <czark...@gmail.com>: [ ... and others Snipped context: There was some discussion of what kind of file names should be allowed to be stored. There was something I read as a suggestion for using a normal form based in Unicode as a target for enforced file name conversion. There were some attempts to discuss reasons why file names should not be forceably converted.
And then communication seemed to really break down when I tried to present a semi-obvious example of why seemingly innocuous conversions turn out to be not so innocuous after all.] And, since that didn't work, I tried with an example closer to the suggested normal form: > > Joel Rees said: > > > Now, what would you do with this? > > > > > > ã¸ã§ã¨ã« > > > > > > Why not decompose it to the following? > > > > > > ï½¼ï¾ï½®ï½´ï¾ Which didn't communicate the problem, either. > > Because it is not what Unicode normalization is. > > Well, it definitely isn't Unicode normalization. And there is a reason, it isn't, even though there > were many who thought the Unicode standard shouldn't include code points for wide form glyphs. > > Let's try one more. I think you have said enough that I can infer that your preferred normal form is > the decomposit form. So, given that your normalization has resulted in a file named > > ã·ãã§ã¨ã«ã®æ > > and given > the necessity to send it back where it came from, how do you know whether or not it should > be restored to > > ã¸ã§ã¨ã«ã®æ > > before you send it back? > > > [...] But normalization is a red herring in this context. You may personally have no problems with filename conversions improperly done, but I am not willing to take them lightly where my data is concerned. I may have a NAS device that I'm using for backup without compression/amalgamation (i. e., tar/zip), and If I have a file with a decomposit name backed up on the NAS, I don't want it automatically converted to composit when it is restored, the existence of normal forms notwithstanding. Unix file names can handle UTF-8 encoded Unicode file names without losing data because no conversion is necessary. There may be issues with displaying them, but the file name itself is safe, because '/' is always '/' and '\0' is always '\0'. You can even handle broken UTF-8 and unconverted UTF-16/32 of whatever byte order spit into the file name as a sequence of bytes if and only if you escape NUL, slash, and your escape character properly, restoring the escaped characters when putting the file names on the network. Normalization alone does not know how to restore a potentially normalized name. It needs some sort of flag character that says "this name was normalized", and a way to choose between de-normalized forms when more than one denormalized form maps to one particular normal form. The last time I looked, the Unicode standard itself stated that this was the case, and that normalized forms were not recomended for such purposes. The craziness currently infecting the entire industry leaves me with no confidence that such is still the case. I haven't used Apple OSses since around 10.4, but Mac OS X was doing a thing where certain well-known directory names were aliased according to the current locale. For instance, the user's "music" directory was shown as ãé³æ¥½ã when the locale was set to ja_JP.UTF-8. This is useful to desktop users, but is sometimes confusing when you log in via ssh from a terminal that does not display Japanese and fails to declare itself as such. It's convenient, but even this can cause problems when backing up the entire home or user directory, if the backup software doesn't know to ask for the OS canonical name. Again, apologies for using my (erk) Android device and spitting html at the list. Joel Rees Computer memory is just fancy paper, CPUs just fancy pens. All is a stream of text flowing from the past into the future.