Selon Ivan <[email protected]>: > > > > Folks, > > > > I wonder if we should implement some mechanism to support UTF-8 filenames > > on windows (and generally) before GDAL 1.7 release?
That would be definitely a cool idea. Apart from Windows, I'm not sure if other supported OSes need work. As far as Linux is concerned, I believe that we can reasonably assume that UTF-8 strings are already passed to GDAL/OGR, as nowadays all distro have switched their locales to UTF-8 (the move started with RedHat 8.0 in 2002), although my readings show that the filesystem encoding is not necessary the same. I've looked a bit at GLIB-2.0 documentation and they have invented a G_FILENAME_ENCODING and G_BROKEN_FILENAMES to deal with those rare situations (http://library.gnome.org/devel/glib/stable/glib-running.html, http://library.gnome.org/devel/glib/stable/glib-Character-Set-Conversion.html). All of this is rather confusing, but I don't think we need to go into that level of complexity. As far as MacOSX is concerned, I can't say. > > > > How dangerous would it be for us to always assume filenames are UTF-8 and > > act accordingly? > > > > One theoretical downside to treating filenames as UTF8 is that we do a lot > > of filename parsing that has no concept that some bytes in the name might > > be part of a multi-byte sequence. So if there was a UTF8 multibyte > > character that happened to include ASCII 92 '\' or ASCII 47 '/' it would > > confuse the path parsers. Also for subdatasets, database connections and > > other esoteric datasource names we do a lot of parsing - splitting on > > spaces, commas, quotes and other special characters. Some of this could be > > confused by unfortunate UTF-8 characters. I suppose we really ought to > > be migrating to doing these manipulations on wchar_t's or perhaps UCS-32 > > arrays. > > > > Hmm, this is getting rather complicated to address fully. On the contrary, UTF-8 garantees that you can't find a byte within the ASCII range (0-127) in a multi-byte UTF-8 character. Multi-byte UTF-8 characters always have their most significant bit at 1. Quoting Wikipedia : "The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially". So UTF-8 would be definitely a good choice as a unicode encoding. > > > > But at least as a hack we could provide a build or runtime mechanism to > > tell cpl_vsil_win32.cpp code that the passed in filename should be > > handled as UTF-8 instead of local code page characters on windows. Would > > that be worth implementing? Like Ivan, I think we must try aiming at the cleanest solution (at least at the API level) to minimize the need for users to port their app. I've hardly any experience on the subject on Windows, but I think we should target the wide-character (UTF-16) variants of the functions of the Windows API rather than then local code page, since UTF-8/UTF-16 conversion to local code page encoding can fail. Andrey mentionned CreateFileW in RFC5. _findfirst would likely need to be changed into _wfindfirst. You mention cpl_vsil_win32.cpp, but cpl_vsil_simple.cpp would probably need changes. http://msdn.microsoft.com/en-us/library/yeby3zcb(VS.71).aspx mentions a _wfopen() wide-character version of fopen. On Windows, GDAL/OGR applications would also need some changes to get their command line options as UTF-8 arguments. I see as GetCommandLineW()/ CommandLineToArgvW() functions (http://msdn.microsoft.com/en-us/library/ms683156(VS.85).aspx). A remaining question is : should we provide a 'compatibility mode' for users that only deal with non-ASCII character in the ANSI range of their local code page and can use it successfully currently ? This could be controlled by a environment variable (CPL_ANSI_FILENAMES=ON) that would revert to the A variants without any string conversions. Or maybe we can assume that the behaviour of current GDAL was undefined for any non-ASCII filename, so we can freely define it without dealing too much with backward compatibility issues Best regards, Even _______________________________________________ gdal-dev mailing list [email protected] http://lists.osgeo.org/mailman/listinfo/gdal-dev
