On 2008-12-12 09:03 +0100, Morten Lemvigh wrote: > No links on a page with a missing last-modified header are > scanned, if the page is on the disk already. If I run: > > wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML > > --08:51:24-- > http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML > => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' > Resolving eur-lex.europa.eu... 147.67.136.2, 147.67.136.102, > 147.67.119.2, ... > Connecting to eur-lex.europa.eu|147.67.136.2|:80... connected. > HTTP request sent, awaiting response... 200 OK > Length: 9.709 (9.5K) [text/html] > Last-modified header missing -- time-stamps turned off. > 08:51:24 (82.42 KB/s) - > `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' saved > [9709/9709] > [....] > > wget will retrieve the page and continue recursively getting all the > linked pages, as I would expect.
OK. This is normal. > If I issue this command a second time, all I get is this: > > wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML > --08:53:18-- > http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML > => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' > Resolving eur-lex.europa.eu... 147.67.119.2, 147.67.119.102, > 147.67.136.2, ... > Connecting to eur-lex.europa.eu|147.67.119.2|:80... connected. > HTTP request sent, awaiting response... 500 Internal Server Error > 08:53:18 ERROR 500: Internal Server Error. > FINISHED --08:53:18-- > Downloaded: 0 bytes in 0 files > > So all the pages linked from this page are ignored to. It's fine > if wget skips the problematic document, but I would prefer wget > to continue the recursive scan. The first time, the local file doesn't exist so Wget issues a GET request, which succeeds (200). The second time, the local file exists so Wget must first check whether the resource has changed. To that end, it issues a HEAD request. The server apparently doesn't know when the document was last modified. It could fullfill the HEAD request without a Last-modified header. Instead, it rejects it with a 500. It's not that that missing Last-modified header causes Wget to "ignore the links". It's that there is no document to scan for links because, when queried about it, the server replied 500. To work around that kind of brokenness, Wget would have to ignore the 500 error and fall back on parsing the local file. That should probably not be made the default behaviour, though. -- André Majorel <URL:http://www.teaser.fr/~amajorel/>
