On Thursday 24 April 2014 12:21:54 Andries E. Brouwer wrote: > > I couldn't read that in your post before (I still can't). If Wget puts > > "illegal" characters into filenames, that is a bug and has to be fixed. > > Then let me clarify this point. Sorry for the length.
Andries, first of thanks for your exhaustive and well written explanation. > What wget does by default: bytes 0-0x1f and 0x77-0x9f are considered > "control" and escaped.... In fact, I oversaw the intersection of Wget's 'control' characters and UTF-8 which is 0x80-0x9f. So I simply missed in your example: > The bytes occurring here are d7 a9 2e 5f d7 a9 d7 a4 d7 a8 d7 94 (hex). > The 0x94 at the end is considered control, and replaced... > These names are not valid UTF-8 strings. > This behaviour of wget is very unfortunate, and people have been > complaining for many years, but so far nobody took the trouble > of fixing this. People not bitten by it consider it low priority > and people bitten mostly live in China or Russia or other faraway > places, and mostly do not mail bug-wget. Still, I found quite a few > bug reports about this problem. The only bug report I remember did not state a bug, it was more of wish to change Wget's default behavior. But maybe I had the same misunderstanding as with your original post. > By far the simplest fix is to change the default. > That is the 1-word change true -> false in > opt.restrict_files_ctrl = true; > > If people like this default (it is a bad default > as I will argue below, but it is current practice) > one can choose many fixes. One is to scan the filename, > and if it is valid UTF-8 leave it unchanged. I just want to mention my concerns about a quick and dirty solution, just that we think about it. (I am not the one to decide, and if it were my private project, I would fix this bug immediately, no doubt.) 1. How do you know, what filesystem you are writing to ? If you suspect the user not being able to change behavior, how should she be able to know about filesystems. I just think of these fat32 USB sticks flying around everywhere. UTF-8 might be a problem (see http://en.wikipedia.org/wiki/Comparison_of_file_systems). I just mention fat32, because it is pretty common. There might be other file systems having a limited charset... A compile/configure time option could be one solution. 2. Backward compatibility. Since the current Wget behavior exist for a long time now, there a definitely many work-arounds (in the means of 'relying onto current behavior') in production. Changing the default might blow up these scripts/programs and may cause some damage. Of course we can say, it is the admin's responsibility to check each software update before rolling out on production, but I guess the reality is different. 3. (Strictly another issue) If we touch the code, what about --restrict-file- names=nocontrol,lowercase ? Should we case-convert UTF-8 ? My answer is yes (and that is what I did in the already mentioned Mget). > Sometimes programs try to be helpful and change data for the user. > This is always very unfortunate. In the old days ftp had the default > "ascii" and did some conversion that destroyed all files one downloaded > (probably compressed archives), and one had to throw the downloaded file > away, and download again, this time not forgetting to add "binary". Not only in the old days. It is still a problem and I stumbled over it two times within the last 6 months. > The desirable state of affairs is that programs designed to > copy information do not modify it, unless explicitly asked. Yes, definitely. But changing historic defaults should be carefully thought of. Tim
