Re: [Bug-wget] bad filename

Tim Ruehsen Thu, 24 Apr 2014 18:54:12 -0700

On Thursday 24 April 2014 12:21:54 Andries E. Brouwer wrote:
> > I couldn't read that in your post before (I still can't). If Wget puts
> > "illegal" characters into filenames, that is a bug and has to be fixed.
> 
> Then let me clarify this point. Sorry for the length.


Andries, first of thanks for your exhaustive and well written explanation.

> What wget does by default: bytes 0-0x1f and 0x77-0x9f are considered
> "control" and escaped....

In fact, I oversaw the intersection of Wget's 'control' characters and UTF-8 
which is 0x80-0x9f.

So I simply missed in your example:
> The bytes occurring here are d7 a9 2e 5f d7 a9 d7 a4 d7 a8 d7 94 (hex).
> The 0x94 at the end is considered control, and replaced...
> These names are not valid UTF-8 strings.

> This behaviour of wget is very unfortunate, and people have been
> complaining for many years, but so far nobody took the trouble
> of fixing this. People not bitten by it consider it low priority
> and people bitten mostly live in China or Russia or other faraway
> places, and mostly do not mail bug-wget. Still, I found quite a few
> bug reports about this problem.

The only bug report I remember did not state a bug, it was more of wish to 
change Wget's default behavior. But maybe I had the same misunderstanding as 
with your original post.

> By far the simplest fix is to change the default.
> That is the 1-word change true -> false in
>      opt.restrict_files_ctrl = true;
> 
> If people like this default (it is a bad default
> as I will argue below, but it is current practice)
> one can choose many fixes. One is to scan the filename,
> and if it is valid UTF-8 leave it unchanged.

I just want to mention my concerns about a quick and dirty solution, just that 
we think about it. (I am not the one to decide, and if it were my private 
project, I would fix this bug immediately, no doubt.)

1. How do you know, what filesystem you are writing to ? If you suspect the 
user not being able to change behavior, how should she be able to know about 
filesystems. I just think of these fat32 USB sticks flying around everywhere. 
UTF-8 might be a problem (see 
http://en.wikipedia.org/wiki/Comparison_of_file_systems). I just mention 
fat32, because it is pretty common. There might be other file systems having a 
limited charset... A compile/configure time option could be one solution.

2. Backward compatibility. Since the current Wget behavior exist for a long 
time now, there a definitely many work-arounds (in the means of 'relying onto 
current behavior') in production. Changing the default might blow up these 
scripts/programs and may cause some damage.
Of course we can say, it is the admin's responsibility to check each software 
update before rolling out on production, but I guess the reality is different.

3. (Strictly another issue) If we touch the code, what about --restrict-file-
names=nocontrol,lowercase ? Should we case-convert UTF-8 ?
My answer is yes (and that is what I did in the already mentioned Mget).

> Sometimes programs try to be helpful and change data for the user.
> This is always very unfortunate. In the old days ftp had the default
> "ascii" and did some conversion that destroyed all files one downloaded
> (probably compressed archives), and one had to throw the downloaded file
> away, and download again, this time not forgetting to add "binary".

Not only in the old days. It is still a problem and I stumbled over it two 
times within the last 6 months.

> The desirable state of affairs is that programs designed to
> copy information do not modify it, unless explicitly asked.

Yes, definitely. But changing historic defaults should be carefully thought 
of.

Tim

Re: [Bug-wget] bad filename

Reply via email to