On Wed, 15 Feb 2017 10:18:32 +0100 <[email protected]> wrote: > On Tue, Feb 14, 2017 at 10:19:14PM +0000, Chris Vine wrote: [snip] > > Filenames and locales are not necessarily related. When you access > > a networked file system, you get the filename encoding you are > > given, which may or may not be the same as the particular locale > > encoding on your particular machine on one particular day, and may > > or may not be a unicode encoding. Glib, for example, enables you > > to set this with the G_FILENAME_ENCODING environmental variable > > [...] > > which is, btw., "just a better approximation", but still wrong: the > application creating a directory might have been "in" a different > locale (and thus having a different encoding) that the one creating > the file whithin that directory. > > Most notably, the whole path might cross several mount points, thus > the whole path can well have fragments coming from several file > systems. > > I think the only sane way to see a Linux file system path is the way > Linux sees it: as a byte string. > > Sure, some helper infrastructure to try to make characters of that > mess will be welcome, but that should be absolutely robust wrt. > unexpected input e.g. bad UTF-8) and leave control to the application. > > Not easy.
I don't disagree. My purpose was to point out that in the modern world of networking and plug-in devices, locales and filenames are disjoint. The glib approach is better than assuming all filenames are in locale encoding, but it is by no means perfect. I came across exactly this problem when writing a small application, mainly for my own use, to manage music files (actually mainly podcasts) on a USB music stick. The stick had its filenames in UTF-8 (somewhat confusingly the text in its index files, which had UTF-8 names, was in UTF-16). This meant that if the computer on which the stick was mounted used a different filename encoding, any file with path could be in a mixed encoding. Because gio's GFile insists that its filenames with path are in the encoding set by G_FILENAME_ENCODING, this meant GFile was only guaranteed to work when the stick was mounted on a computer with filename encoding set to UTF-8. In the end I just used the standard POSIX functions to open, close, read and write files which, because linux is codeset agnostic, worked fine. To display filenames in GTK+, I was able to apply g_filename_to_utf8() to the mount point only and know that the remainder of the file name was guaranteed to be in UTF-8 already. > > g_filename_to_utf8() and g_filename_from_utf8() functions for this > > purpose. > > To me, that seems insufficient, unless this just applies to one > (e.g. the last) path element. Skimming the docs I can't see whether > you are only supposed to do that or whether you can dump whole paths > (or path fragments) into those functions. You can do whatever you want with these functions. They just convert a text fragment from filename encoding to UTF-8 (if different). They are the filename encoding equivalent of g_locale_to_utf8() and g_locale_from_utf8() for the locale encoding. If you pass them a filename with path, and that is in a mixed encoding, it won't work. There are variants which will gracefully degrade in case of encoding errors - g_filename_display_name() and g_filename_display_basename(). [snip] > It's moving between those two views what's hard. Personally, I'd > tend to have Guile being agnostic (i.e. byte arrays) at the lowest > level (no conversions), and offer the application what it knows > (on BSD or "modern" Windows say: "yes, that's UTF-8" and on Linux > say "No idea, but you can try to convert"). > > Current locale is just a weak hint one might use in heuristics. > For things like environment variables and command line arguments, > locale is a stronger hint (but not 100%). I would prefer guile to make the filename encoding a fluid. It wouldn't deal with files mounted with mixed encodings, but it would cater for everything else. Chris
