There was a case ealier today on the IRC channel that I'd like to bring out here. The user in question was attempting to continue a recursive retrieval. The files being dowloaded were large binaries. However, Wget still happens to load files that have already been downloaded in an attempt to find new links. Below is the debug output that the user shared:
Dequeuing https://marius:[email protected]/remote/path/name/remote_file_name at depth 2 Queue count 4, maxcount 19. --2014-03-22 12:15:16-- https://marius:*password*@remote.host.name/remote/path/name/remote_file_name Found remote.host.name in host_name_addresses_map (0x80fe790) Connecting to remote.host.name (remote.host.name)|1.2.3.4|:443... connected. Created socket 3. Releasing 0x080fe790 (new refcount 1). Initiating SSL handshake. Handshake successful; connected socket 3 to SSL handle 0x08101418 certificate: subject: /CN=localhost.localdomain issuer: /CN=localhost.localdomain WARNING: cannot verify remote.host.name's certificate, issued by `/CN=localhost.localdomain': Self-signed certificate encountered. WARNING: certificate common name `localhost.localdomain' doesn't match requested host name `remote.host.name'. ---request begin--- HEAD /remote/path/name/remote_file_name HTTP/1.1 Referer: https://remote.host.name/remote/path/name/ Range: bytes=1776617195- User-Agent: Wget/1.13.4 (linux-gnu) Accept: */* Host: remote.host.name Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 401 Authorization Required Date: Sat, 22 Mar 2014 11:15:17 GMT Server: Apache/2.2.14 (Ubuntu) WWW-Authenticate: Basic realm="Identify yourself" Vary: Accept-Encoding Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 ---response end--- 401 Authorization Required Registered socket 3 for persistent reuse. Disabling further reuse of socket 3. Closed 3/SSL 0x08101418 Found remote.host.name in host_name_addresses_map (0x80fe790) Connecting to remote.host.name (remote.host.name)|1.2.3.4|:443... connected. Created socket 3. Releasing 0x080fe790 (new refcount 1). Initiating SSL handshake. Handshake successful; connected socket 3 to SSL handle 0x08101418 certificate: subject: /CN=localhost.localdomain issuer: /CN=localhost.localdomain WARNING: cannot verify remote.host.name's certificate, issued by `/CN=localhost.localdomain': Self-signed certificate encountered. WARNING: certificate common name `localhost.localdomain' doesn't match requested host name `remote.host.name'. ---request begin--- HEAD /remote/path/name/remote_file_name HTTP/1.1 Referer: https://remote.host.name/remote/path/name/ Range: bytes=1776617195- User-Agent: Wget/1.13.4 (linux-gnu) Accept: */* Host: remote.host.name Connection: Keep-Alive Authorization: Basic bWFyaXVzOnJlbWxvdHVz ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 416 Requested Range Not Satisfiable Date: Sat, 22 Mar 2014 11:15:17 GMT Server: Apache/2.2.14 (Ubuntu) Vary: Accept-Encoding Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 ---response end--- 416 Requested Range Not Satisfiable Registered socket 3 for persistent reuse. URI content encoding = `iso-8859-1' The file is already fully retrieved; nothing to do. Disabling further reuse of socket 3. Closed 3/SSL 0x08101418 Loaded /local/path/name/local_file_name (size 1776617195). no-follow in /local/path/name/local_file_name: 0 As you can see, Wget receives only a HTTP 416 response with Content-type text/html, but it still loads the complete 2GB file in memory, looking for links. Since Wget does not know the filetype at this moment, I agree it might be the right thing to do, but according to section 7.2.1 of RFC2616, " Any HTTP/1.1 message containing an entity-body SHOULD include a Content-Type header field defining the media type of that body. If and only if the media type is not given by a Content-Type field, the recipient MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the recipient SHOULD treat it as type "application/octet-stream". " Hence, Wget's behaviour seems to be against what the specifications mandates. However, I understand that for continuing recursive retrievals, we may want to scan all existing files too. Maybe, Wget could write a simple flat file with the relevant details in case it is being aborted? This way it knows what files it *Should* parse and which ones it shouldn't. The user reporting this issue had the problem that Wget would block for almost 30 seconds on each downloaded file while it loads it into memory, while it simply skipped over newly downloaded files, giving me the idea that the server did indeed send the right content-type headers with HTTP 200 responses. I'm looking for comments and opinions of how Wget should hand;e such corner cases. -- Thanking You, Darshit Shah
