Dear Eli and Tim, First, I would say, my last 2 patches are for different problems.
Next, let's make it clear: 'Make url_file_name also convert remote path to local encoded', is to convert all characters from URL (server, most UTF8) to locale encoded (GBK for example), and then append them to the '-P' specified local path. Or if we use iconv on a mix-encoded string, error occurs. Right? :) It is for iconv. 'Fix printing mutibyte characters as unprintable characters on Windows', this one need 'setlocale' to be called in case of 'ENABLE_NLS' is not defined for windows, to make it display the non-ASC chas correctly in console. :) As Eli said. Please refer to https://msdn.microsoft.com/en-us/library/x99tb11d.aspx. It is for displaying in console. Best Regards, YX Hao > -----Original Message----- > From: Eli Zaretskii [mailto:[email protected]] > Sent: 2017年11月14日 0:33 > To: Tim Rühsen <[email protected]> > Cc: [email protected]; [email protected] > Subject: Re: [Bug-wget] Patch: Make url_file_name also convert remote path to > local encoded > > > Cc: [email protected], [email protected] > > From: Tim Rühsen <[email protected]> > > Date: Mon, 13 Nov 2017 16:36:39 +0100 > > > > > I don't think it's a Gnulib issue. The problem is that on Windows, > > > the implicit call at the beginning of Wget > > > > > > setlocale (LC_ALL, "C"); > > > > Why is there an explicit call with "C" ? There is an explicit call with "". > > I said "implicit", not "explicit". Such an implicit call is made at the > beginning > of every C program, per ANSI C Standard. Right? > > The MSDN documentation says it clearly: > > At program startup, the equivalent of the following statement is executed: > > setlocale( LC_ALL, "C" ); > > > From the man page: > > "If locale is an empty string, "", each part of the locale that should > > be modified is set according to the environment variables." > > The call with a locale of "" is only done in a build that has ENABLE_NLS > defined. > I was talking about a build which didn't define ENABLE_NLS. > > > > is not good enough to work in multibyte locales of the Far East, > > > because the Windows runtime assumes a single-byte locale after that > > > call. And since Wget happens to need to display text and create > > > files with non-ASCII characters, it gets hit more than other programs. > > > > I (hopefully) can understand why this doesn't work. NTFS uses UTF-16 > > for the filenames. If your environment specifies a single-character > > encoding (e.g. C) and we use at some point a multi-character encoding (e.g. > > utf-8), then any automatic conversion to UTF-16 filenames are likely > > to fail. For me the question is: a) does wget has a bug (e.g. creating > > a filename with a wrong encoded name string or b) does the Windows API > > has a problem. > > > > > The proposed solution is to add a special call to setlocale which > > > gets this right on Windows. > > > > Why can't we just convert the filename string into the correct > > encoding and then create the file ? What do I miss ? > > I guess you are missing a short introduction to the Windows l10n/i18n mess. > Let me try. > > First, the fact that NTFS uses UTF-16 is not really relevant. Wget uses > 'char *' > strings, not 'wchar *' strings to store file names and call C library > functions that > accept file names. So we cannot use the > UTF-16 encoding of non-ASCII file names directly. Instead, we use the > locale's > codepage (the C library and the OS APIs then convert to > UTF-16 before hitting the disk, but that's not important now). > > Next, creating and opening file names is not the only problem: we need also to > display these file names and URLs, and that also needs to use the encoding > expected by the Windows console. > > Now, in any locale which uses single-byte encoding of non-ASCII characters, > the > C locale will support those characters, both for I/O and for functions like > strcmp, > strlen, strcoll, etc. But not in double-byte locales of the Far East: there, > you > must explicitly call setlocale with the correct codepage, to have the local > character set supported. This support includes manipulating file names, > calling C library functions to access files, and displaying non-ASCII text, > such as > file names and URLs, on the console. > > IOW, this is a Windows runtime subtlety that unfortunately needs to be fixed > in > the application code. > > (UTF-8 is not relevant at all here, because Windows doesn't support > UTF-8 as the locale's codeset; if you try to call setlocale to set > UTF-8 as the codeset, setlocale will simply fail. So if we have a > UTF-8 encoded URL or file name inside wget, we must convert it to the current > codepage by calling libiconv functions.) > > Does the above make sense? Let me know if I have to explain some more.
