> Date: Mon, 17 Aug 2015 12:59:05 +0200 > From: "Andries E. Brouwer" <[email protected]> > Cc: "Andries E. Brouwer" <[email protected]>, [email protected], > [email protected] > > On Mon, Aug 17, 2015 at 05:39:34AM +0300, Eli Zaretskii wrote: > > (i) [about using setlocale] > > > > > First, relying on UTF-8 locale to be announced in the environment > > > > is less portable than it could be: it's better to call 'setlocale' > > > > Then ... at least Cygwin will not be excluded from this feature. > > > > > > I left the wget behaviour for MSDOS / Windows / Cygwin unchanged > > > because I do not know anything about these platforms. > > > > These systems don't normally have the LC_* environment > > variables, and their 'setlocale' (with the exception of Cygwin) does > > not look at those variables. But you _can_ obtain the current locale > > on all supported systems by calling 'setlocale'. > > Good. Then perhaps using setlocale would be better. > > I will not do so - do not feel confident on the Windows platform.
You don't need to -- do it on your OS, and the same will work elsewhere. > After all, the goal is not to find out what locale we are in, > but to find out whether it might be a good idea to escape certain > bytes in a filename. Indeed, you want the current locale's codeset, see below. > On Windows I guess that FAT filesystems will use some code page, > and NTFS filesystems will use Unicode. Not exactly. The functions that emulate Posix and accept file names as "char *" strings cannot use Unicode on Windows, because using Unicode means using wchar_t strings instead. So, unless Someone⢠changes wget to do that, at least on Windows, the Windows port will still use the current system codepage, even on NTFS, because that's what functions like 'fopen', 'open', 'stat', etc. assume. > (ii) [about possibly using iconv] > > >> How do you guess the original character set? > > Since you pass silently over this point No, I just missed that, sorry. The answer is call "nl_langinfo (CODESET)". Windows doesn't have 'nl_langinfo', but it is easily emulated with more or less a single API call, or we could use the Gnulib replacement (which already does support Windows). > it seems there is no good way to involve iconv. Actually, there's no problem, see above. Many programs do it like that already. > > This is a philosophical question: is a Cyrillic file name encoded in > > koi8-r and the same name encoded in UTF-8 a "modified data" or the > > same data expressed in different codesets. > > Unix filenames are not necessarily in any particular character set. > They are sequences of bytes different from NUL and '/'. > A different sequence of bytes is a different filename. As long as you treat them as UTF-8 encoded strings, they are, for all practical purposes, in the Unicode character set. (Which, btw, brings up the question what to do if the UTF-8 sequence is for u+FFFD or is simply invalid -- do we treat them as control characters or don't we?) > Also, "the same name encoded in UTF-8" is an optimistic description. > Should the Unicode be NFC? Or NFD? MacOS has a third version. It doesn't matter, since any filesystem worth its sectors will DTRT and any ls-like program will, too, and will show you a perfectly legible file name. > Even if the filename had a well-defined and known character set, > conversion to UTF-8 is not uniquely defined. Do whatever iconv does, and we will be fine.
