[Bug-wget] wget fails to encode spaces in URLs

2011-06-05 Thread Volker Kuhlmann
  wget --version
GNU Wget 1.12 built on linux-gnu.

To reproduce:

Go to any sourceforge project and download a file whos URL contains a
space. Copy the direct link from the download page into wget -i-

Run wireshark and press ^D in the wget input stream.

If the upstream strips spaces (e.g. squid, default setting in pfsense)
the download goes round in circles.

The bug does not exist in wget when passing the URL on the command line.
I always use -i- because of all the shell crud in URLs.

I am using the openSUSE 11.4 version, but the only source code change is
additional support for libproxy.


Problem:

Looking at the source, in main.c url_parse() is called for each URL from
the command line. For -i, it calls retrieve_from_file().

retrieve_from_file() (in retr.c) reads a list of URLs from the given
file. It then calls url_parse() only if IRI is enabled (which in my
version of wget is not even compiled in).
Hence the URL is never parsed and never encoded before being downloaded
with retrieve_url().
That's a bug.

The fix is probably to always call url_parse() in retrieve_from_file(),
and not only when IRI is turned on.


If this goes to a mailing list, please cc me on replies, I am not
subscribed.

Thanks,

Volker

-- 
Volker Kuhlmann
http://volker.dnsalias.net/



Re: [Bug-wget] wget fails to encode spaces in URLs

2011-06-05 Thread Giuseppe Scrivano
Hi Volker,

thanks to have reported this bug but it was fixed in the development
version of wget and the fix will be included in the next release.

Can you please confirm if it works for you?

You can fetch a source tarball here:
  ftp://alpha.gnu.org/gnu/wget/wget-1.12-2504.tar.bz2

Thanks,
Giuseppe



Volker Kuhlmann list0...@paradise.net.nz writes:

   wget --version
 GNU Wget 1.12 built on linux-gnu.

 To reproduce:

 Go to any sourceforge project and download a file whos URL contains a
 space. Copy the direct link from the download page into wget -i-

 Run wireshark and press ^D in the wget input stream.

 If the upstream strips spaces (e.g. squid, default setting in pfsense)
 the download goes round in circles.

 The bug does not exist in wget when passing the URL on the command line.
 I always use -i- because of all the shell crud in URLs.

 I am using the openSUSE 11.4 version, but the only source code change is
 additional support for libproxy.


 Problem:

 Looking at the source, in main.c url_parse() is called for each URL from
 the command line. For -i, it calls retrieve_from_file().

 retrieve_from_file() (in retr.c) reads a list of URLs from the given
 file. It then calls url_parse() only if IRI is enabled (which in my
 version of wget is not even compiled in).
 Hence the URL is never parsed and never encoded before being downloaded
 with retrieve_url().
 That's a bug.

 The fix is probably to always call url_parse() in retrieve_from_file(),
 and not only when IRI is turned on.


 If this goes to a mailing list, please cc me on replies, I am not
 subscribed.

 Thanks,

 Volker