On Fri, 22 Feb 2002, Markus Kuhn wrote: > Gaspar Sinai wrote on 2002-02-22 00:31 UTC: > > When looking at these as strings (accoring to Unicode) these > > strings are the same. In case these strings are used as > > filenames they will be considered different. Can we resolve > > this issue without normalization? > > Perhaps an extension of the globbing/regexp API is where you want to > start.
I was thinking about this: maybe the NFS server could enforce normalization form 'C' so that only the precomposed variant: U+00F6 ö could create a file. A huge number of scripts could be supported, without duplicate filenames. Hangul would immediatelly be ok without the need of jamo decomposition. And we are also very lucky that CJK can not be decomposed to radicals :) I admit this would create some problems... The unsupported scripts would have two categories: 1. Scripts that could be supported if a precomposed character could go into the standard. For instance: Guarani U+0067 U+0303 g̃ (g with a tilde above) Guarani U+0047 U+0303 G̃ (G with a tilde above) 2. Scripts that need further thinking: - Indic: contrary to common belief some scripts have finite number of combinations, like Tamil. - Arabic: if we don't use presentation forms we should be fine. All the above, of course is just food for thought. My fear is that if we let NFS server take any legal UTF-8 sequence, in some case we could end up 2 or more files that look the same. If certain utilities are not trained to treat filenames differently than normal text we might not be able to perform certain operations, like find operation. Maybe someone on this list has some idea how to train these utilities. The only reason why I don't like the client side to be made resposible is because there are a lot of utilities out there. Some of them sorts and expunges duplicates and they all will use unicode to do that. Some commands may not even know that the strings are filenames and equivalent strings may not be equivalent: two file-names could appear as one after the uniq(1) command, for instance. Thanks gaspar -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
