Version: wget 1.8.1
Platform: Linux Mandrake 8.1
(both Mandrake packaged version and locally compiled vanilla version)

How to reproduce (example):
wget -o wget.debuglog -d -v \
    -r -np -p -x -N -nH \
    -R gz,zip,tgz,pdf,ps \
    http://www.w3.org/TR/xmlschema-0

There is an error in what I was doing: I refer to "/TR/xmlschema-0"
rather than "/TR/xmlschema-0/", but this highlights a possible problem
in wget, I think.

wget receives a HTTP 301 (Moved Permanently) response in response to the
"GET http://www.w3.org/TR/xmlschema-0";, and is given a Location: of
"http://www.w3.org/TR/xmlschema-0/";.  wget responds to this redirection,
and proceeds to fetch the page correctly (as TR/xmlschema-0/index.html).

The problem comes once it tries to decide which links to follow
recursively - it doesn't view "-np" as working from the new URL, it uses
the original one.  This means that it views "http://www.w3.org/TR/"; as
the root of documents that it is allowed to download, and proceeds to
get a little bit carried away - trying to download pretty much the whole
of the /TR/ area of www.w3.org, in fact.

I believe that wget should treat the redirect Location: as the root of
the download.  i.e., it should have restricted the download to
/TR/xmlschema-0/.  I can see complications if redirections occur lower
down the recursion, but that's another problem :-)

[I suspect (though I haven't got an example), that this behaviour would
also prevent a recursive download in the case that a redirect moved
outside the original hierarchy (e.g., /TR/xmlschema-0/ to
/NOT-TR/xmlschema-0/).]

phil, <[EMAIL PROTECTED]>


I've extracted bits from the log that are appropriate (a full log is
available if necessary).

---START OF LOG---
DEBUG output created by Wget 1.8.1 on linux-gnu.

Enqueuing http://www.w3.org/TR/xmlschema-0 at depth 0
Queue count 1, maxcount 1.
Dequeuing http://www.w3.org/TR/xmlschema-0 at depth 0
Queue count 0, maxcount 1.
--11:32:42--  http://www.w3.org/TR/xmlschema-0
           => `TR/xmlschema-0'
Resolving www.w3.org... done.
Caching www.w3.org => 193.51.208.68
Connecting to www.w3.org[193.51.208.68]:80... connected.
Created socket 4.
Releasing 0x807f8a0 (new refcount 1).
---request begin---
GET /TR/xmlschema-0 HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.w3.org
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 301 Moved Permanently
Date: Fri, 08 Feb 2002 11:32:54 GMT
Server: Apache/1.3.20 (Unix) PHP/3.0.18
Location: http://www.w3.org/TR/xmlschema-0/
Connection: close
Content-Type: text/html; charset=iso-8859-1

Location: http://www.w3.org/TR/xmlschema-0/ [following]
Closing fd 4
--11:32:43--  http://www.w3.org/TR/xmlschema-0/
           => `TR/xmlschema-0/index.html'
Found www.w3.org in host_name_addresses_map (0x807f8a0)
Connecting to www.w3.org[193.51.208.68]:80... connected.
Created socket 4.
Releasing 0x807f8a0 (new refcount 1).
---request begin---
GET /TR/xmlschema-0/ HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.w3.org
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 200 OK
Date: Fri, 08 Feb 2002 11:32:56 GMT
Server: Apache/1.3.20 (Unix) PHP/3.0.18
P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml";
Cache-Control: max-age=21600
Expires: Fri, 08 Feb 2002 17:32:56 GMT
Last-Modified: Tue, 01 May 2001 12:41:33 GMT
ETag: "5bc532-49c8a-3aeeaefd"
Accept-Ranges: bytes
Content-Length: 302218
Keep-Alive: timeout=15
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

Found www.w3.org in host_name_addresses_map (0x807f8a0)
Registered fd 4 for persistent reuse.
Length: 302,218 [text/html]

[...download and link analysis deleted...]]

no-follow in TR/xmlschema-0/index.html: 0
Deciding whether to enqueue "http://www.w3.org/StyleSheets/TR/W3C-REC";.
Decided to load it.
Enqueuing http://www.w3.org/StyleSheets/TR/W3C-REC at depth 1
Queue count 1, maxcount 1.
Deciding whether to enqueue "http://www.w3.org/";.
Going to "" would escape "TR" with no_parent on.
Decided NOT to load it.

[... lots more decisions ...]

Deciding whether to enqueue "http://www.w3.org/2001/03/XMLSchema/TypeLibrary.xsd";.
Going to "2001/03/XMLSchema" would escape "TR" with no_parent on.
Decided NOT to load it.

[... more decisions and then downloading ...]

--- END ---

-- 
Phil Richards, <[EMAIL PROTECTED]>

Reply via email to