Hi Andries, as I already mentioned, changing the default behavior of wget is not a good idea.
But I started a wget2 branch that produces wget and wget2 executables. wget2's default behavior is to keep filenames as they are. I am not sure how it compiles and works on Windows (Cygwin could work). If you dare to check it out: any feedback is highly welcome. Regards, Tim On Thursday 06 August 2015 23:40:45 Andries E. Brouwer wrote: > Today I again downloaded a large tree with wget and got only unusable > filenames. Fortunately I have the utility wgetfix that repairs the > consequences of this bug (see > http://www.win.tue.nl/~aeb/linux/misc/wget.html ), but nevertheless this > wget bug should be fixed. > > (Maybe it has been fixed already? I looked at this in detail last year, > and there was some correspondence but I think nothing happened. > Have not looked at the latest sources.) > > What happens is that wget under certain circumstances escapes > certain bytes in a filename. I think that this was always a mistake, > but it did not occur very much and was defendable: filenames with > embedded control characters are a pain. > > Today the situation is just the opposite: when copying from a remote > utf8 system to a local utf8 system correct and normal filenames > are "escaped" to create illegal filenames that cannot be used > and are worse than a pain, one cannot do much else than discard them. > > What can the user do? > > If she is on Windows, she is told to switch to Linux: > > I can't help Windows users, but Wget is a power-user tool. > > And a Windows power-user should be able to start a virtual > > machine with Linux running to use tools like Wget. > > Is she is on Linux, the easiest is to discard all that was downloaded > and start over again, this time with the option > --restrict-file-names=nocontrol > > If the user knows about wgetfix, that is an alternative. > > One can also use curl instead of wget. > > See also > > http://savannah.gnu.org/bugs/?37564 > http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors > http://stackoverflow.com/questions/27054765/wget-japanese-characters > http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-usin > g-wget http://www.win.tue.nl/~aeb/linux/misc/wget.html > > Below I suggested an easy fix, and discussed some details. > > Andries > > On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote: > > On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote: > > > On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote: > > >> If I ask wget to download the wikipedia page > > >> > > >> http://he.wikipedia.org/wiki/ש._שפרה > > >> > > >> then I hope for a resulting file ש._שפרה. > > >> Instead, wget gives me ש._שפר\327%94, where the \327 > > >> is an unpronounceable byte that cannot be typed > > >> (This is an UTF-8 system and the filename > > >> that wget produces is not valid UTF-8.) > > >> > > >> Maybe it would be better if wget by default used the original filename. > > >> This name mangling is a vestige of old times, it seems to me. > > > > > > This is a commonly reported grievance and as you correctly mention a > > > vestige of old times. With UTF-8 supported filesystems, Wget should > > > simply write the correct characters. > > > > > > I sincerely hope this issue is resolved as fast as possible, but I > > > know not how to. Those who understand i18n should work on this. > > > > It is very easy to resolve the issue, but I don't know how backwards > > compatible the wget developers want to be. > > > > The easiest solution is to change the line (in init.c:defaults()) > > > > opt.restrict_files_ctrl = true; > > > > into > > > > opt.restrict_files_ctrl = false; > > > > That is what I would like to see: > > the default should be to preserve the name as-is, > > and there should be options "escape_control" or so > > to force the current default behaviour. > > > > There are also more complicated solutions. > > One can ask for LC_CTYPE or LANG or some such thing, > > and try to find out whether the current system is UTF-8, > > and only in that case set restrict_files_ctrl to false. > > > > I don't know anything about the Windows environment. > > > > Andries > > > > > > [Discussion: > > > > There is a flag --restrict-file-names. The manual page says > > "By default, Wget escapes the characters that are not valid or safe > > > > as part of file names on your operating system, as well as control > > characters that are typically unprintable." > > > > Presently this is false: On a UTF-8 system Wget by default introduces > > illegal characters. The option "nocontrol" is needed to preserve the > > correct name. > > > > The flag is handled in init.c:cmd_spec_restrict_file_names() > > where opt.restrict_files_{os,case,ctrl,nonascii} are set. > > Of interest is the restrict_files_ctrl flag. > > Today init.c does by default: > > > > #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) > > > > opt.restrict_files_os = restrict_windows; > > > > #else > > > > opt.restrict_files_os = restrict_unix; > > > > #endif > > > > opt.restrict_files_ctrl = true; > > opt.restrict_files_nonascii = false; > > opt.restrict_files_case = restrict_no_case_restriction; > > > > The value of these flags is used in url.c:append_uri_pathel > > where FILE_CHAR_TEST (*p, mask) is used to decide what bytes > > in the filename need quoting. > > > > This is too simplistic an approach: quoting is introduced > > in the middle of multibyte characters. So the current setup > > is buggy and wrong. Basically the choice is between making > > the unfortunately named "nocontrol" (it should be called > > "preserve_name" or so) the default and adding more heuristics > > to detect and solve the worst problems. For example, > > UTF-8 is easy to detect, so if a filename is valid UTF-8 > > one can preserve it. Of course there are other multi-byte > > character sets in widespread use in East Asia.]