[Bug-wget] [Bug-Wget] Issue in recursive retrievals

Darshit Shah Sat, 22 Mar 2014 10:12:07 -0700

There was a case ealier today on the IRC channel that I'd like to
bring out here.
The user in question was attempting to continue a recursive retrieval.
The files being dowloaded were large binaries. However, Wget still
happens to load files that have already been downloaded in an attempt
to find new links. Below is the debug output that the user shared:



Dequeuing 
https://marius:[email protected]/remote/path/name/remote_file_name
at depth 2
Queue count 4, maxcount 19.
--2014-03-22 12:15:16--
https://marius:*password*@remote.host.name/remote/path/name/remote_file_name
Found remote.host.name in host_name_addresses_map (0x80fe790)
Connecting to remote.host.name (remote.host.name)|1.2.3.4|:443... connected.
Created socket 3.
Releasing 0x080fe790 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 3 to SSL handle 0x08101418
certificate:
  subject: /CN=localhost.localdomain
  issuer:  /CN=localhost.localdomain
WARNING: cannot verify remote.host.name's certificate, issued by
`/CN=localhost.localdomain':
  Self-signed certificate encountered.
    WARNING: certificate common name `localhost.localdomain' doesn't
match requested host name `remote.host.name'.

---request begin---
HEAD /remote/path/name/remote_file_name HTTP/1.1
Referer: https://remote.host.name/remote/path/name/
Range: bytes=1776617195-
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: remote.host.name
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 401 Authorization Required
Date: Sat, 22 Mar 2014 11:15:17 GMT
Server: Apache/2.2.14 (Ubuntu)
WWW-Authenticate: Basic realm="Identify yourself"
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

---response end---
401 Authorization Required
Registered socket 3 for persistent reuse.
Disabling further reuse of socket 3.
Closed 3/SSL 0x08101418
Found remote.host.name in host_name_addresses_map (0x80fe790)
Connecting to remote.host.name (remote.host.name)|1.2.3.4|:443... connected.
Created socket 3.
Releasing 0x080fe790 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 3 to SSL handle 0x08101418
certificate:
  subject: /CN=localhost.localdomain
  issuer:  /CN=localhost.localdomain
WARNING: cannot verify remote.host.name's certificate, issued by
`/CN=localhost.localdomain':
  Self-signed certificate encountered.
    WARNING: certificate common name `localhost.localdomain' doesn't
match requested host name `remote.host.name'.

---request begin---
HEAD /remote/path/name/remote_file_name HTTP/1.1
Referer: https://remote.host.name/remote/path/name/
Range: bytes=1776617195-
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: remote.host.name
Connection: Keep-Alive
Authorization: Basic bWFyaXVzOnJlbWxvdHVz

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 416 Requested Range Not Satisfiable
Date: Sat, 22 Mar 2014 11:15:17 GMT
Server: Apache/2.2.14 (Ubuntu)
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

---response end---
416 Requested Range Not Satisfiable
Registered socket 3 for persistent reuse.
URI content encoding = `iso-8859-1'

    The file is already fully retrieved; nothing to do.

Disabling further reuse of socket 3.
Closed 3/SSL 0x08101418
Loaded /local/path/name/local_file_name (size 1776617195).
no-follow in /local/path/name/local_file_name: 0

As you can see, Wget receives only a HTTP 416 response with
Content-type text/html, but it still loads the complete 2GB file in
memory, looking for links. Since Wget does not know the filetype at
this moment, I agree it might be the right thing to do, but according
to section 7.2.1 of RFC2616,
"
   Any HTTP/1.1 message containing an entity-body SHOULD include a
   Content-Type header field defining the media type of that body. If
   and only if the media type is not given by a Content-Type field, the
   recipient MAY attempt to guess the media type via inspection of its
   content and/or the name extension(s) of the URI used to identify the
   resource. If the media type remains unknown, the recipient SHOULD
   treat it as type "application/octet-stream".

"
Hence, Wget's behaviour seems to be against what the specifications mandates.

However, I understand that for continuing recursive retrievals, we may
want to scan all existing files too. Maybe, Wget could write a simple
flat file with the relevant details in case it is being aborted? This
way it knows what files it *Should* parse and which ones it shouldn't.

The user reporting this issue had the problem that Wget would block
for almost 30 seconds on each downloaded file while it loads it into
memory, while it simply skipped over newly downloaded files, giving me
the idea that the server did indeed send the right content-type
headers with HTTP 200 responses.

I'm looking for comments and opinions of how Wget should hand;e such
corner cases.

-- 
Thanking You,
Darshit Shah

[Bug-wget] [Bug-Wget] Issue in recursive retrievals

Reply via email to