Version: wget 1.8.1 Platform: Linux Mandrake 8.1 (both Mandrake packaged version and locally compiled vanilla version)
How to reproduce (example): wget -o wget.debuglog -d -v \ -r -np -p -x -N -nH \ -R gz,zip,tgz,pdf,ps \ http://www.w3.org/TR/xmlschema-0 There is an error in what I was doing: I refer to "/TR/xmlschema-0" rather than "/TR/xmlschema-0/", but this highlights a possible problem in wget, I think. wget receives a HTTP 301 (Moved Permanently) response in response to the "GET http://www.w3.org/TR/xmlschema-0", and is given a Location: of "http://www.w3.org/TR/xmlschema-0/". wget responds to this redirection, and proceeds to fetch the page correctly (as TR/xmlschema-0/index.html). The problem comes once it tries to decide which links to follow recursively - it doesn't view "-np" as working from the new URL, it uses the original one. This means that it views "http://www.w3.org/TR/" as the root of documents that it is allowed to download, and proceeds to get a little bit carried away - trying to download pretty much the whole of the /TR/ area of www.w3.org, in fact. I believe that wget should treat the redirect Location: as the root of the download. i.e., it should have restricted the download to /TR/xmlschema-0/. I can see complications if redirections occur lower down the recursion, but that's another problem :-) [I suspect (though I haven't got an example), that this behaviour would also prevent a recursive download in the case that a redirect moved outside the original hierarchy (e.g., /TR/xmlschema-0/ to /NOT-TR/xmlschema-0/).] phil, <[EMAIL PROTECTED]> I've extracted bits from the log that are appropriate (a full log is available if necessary). ---START OF LOG--- DEBUG output created by Wget 1.8.1 on linux-gnu. Enqueuing http://www.w3.org/TR/xmlschema-0 at depth 0 Queue count 1, maxcount 1. Dequeuing http://www.w3.org/TR/xmlschema-0 at depth 0 Queue count 0, maxcount 1. --11:32:42-- http://www.w3.org/TR/xmlschema-0 => `TR/xmlschema-0' Resolving www.w3.org... done. Caching www.w3.org => 193.51.208.68 Connecting to www.w3.org[193.51.208.68]:80... connected. Created socket 4. Releasing 0x807f8a0 (new refcount 1). ---request begin--- GET /TR/xmlschema-0 HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.w3.org Accept: */* Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... HTTP/1.1 301 Moved Permanently Date: Fri, 08 Feb 2002 11:32:54 GMT Server: Apache/1.3.20 (Unix) PHP/3.0.18 Location: http://www.w3.org/TR/xmlschema-0/ Connection: close Content-Type: text/html; charset=iso-8859-1 Location: http://www.w3.org/TR/xmlschema-0/ [following] Closing fd 4 --11:32:43-- http://www.w3.org/TR/xmlschema-0/ => `TR/xmlschema-0/index.html' Found www.w3.org in host_name_addresses_map (0x807f8a0) Connecting to www.w3.org[193.51.208.68]:80... connected. Created socket 4. Releasing 0x807f8a0 (new refcount 1). ---request begin--- GET /TR/xmlschema-0/ HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.w3.org Accept: */* Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Fri, 08 Feb 2002 11:32:56 GMT Server: Apache/1.3.20 (Unix) PHP/3.0.18 P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml" Cache-Control: max-age=21600 Expires: Fri, 08 Feb 2002 17:32:56 GMT Last-Modified: Tue, 01 May 2001 12:41:33 GMT ETag: "5bc532-49c8a-3aeeaefd" Accept-Ranges: bytes Content-Length: 302218 Keep-Alive: timeout=15 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 Found www.w3.org in host_name_addresses_map (0x807f8a0) Registered fd 4 for persistent reuse. Length: 302,218 [text/html] [...download and link analysis deleted...]] no-follow in TR/xmlschema-0/index.html: 0 Deciding whether to enqueue "http://www.w3.org/StyleSheets/TR/W3C-REC". Decided to load it. Enqueuing http://www.w3.org/StyleSheets/TR/W3C-REC at depth 1 Queue count 1, maxcount 1. Deciding whether to enqueue "http://www.w3.org/". Going to "" would escape "TR" with no_parent on. Decided NOT to load it. [... lots more decisions ...] Deciding whether to enqueue "http://www.w3.org/2001/03/XMLSchema/TypeLibrary.xsd". Going to "2001/03/XMLSchema" would escape "TR" with no_parent on. Decided NOT to load it. [... more decisions and then downloading ...] --- END --- -- Phil Richards, <[EMAIL PROTECTED]>