[Bug-wget] escaped URLs and recursive retrieval

Bram Vandoren Mon, 21 Nov 2011 15:14:00 -0800

Hi,

I encountered a bug in wget that occurs with recursive retrieval: if apage contains 2 (or more) links:

<a href="http://example.com/~user/blah";> and
<a href="http://example.com/%7Euser/blah";>

Both links point to the same page but the encoding is different. wgetdoesn't recognise this as the same page and downloads the page 'blah'twice. It also overwrites the first downloaded file.Also if you specify the conversion option '-k', it only converts one ofthe two links.

I had a quick look at the source code. It can be solved by changingurl_parse in url.c. Call url_unescape before parsing the url. This wayyou get a the same parsed url for both links. I am not sure if this is agood way to solve it. The conversion should probably be similar to theconversion that's done to determine the file name of the URL.


Kind regards,
Bram.

[Bug-wget] escaped URLs and recursive retrieval

Reply via email to