[Bug-wget] --page-requisites and robot exclusion issue

markk Sun, 04 Dec 2011 10:47:59 -0800

Hi,

I'm using wget 1.13.4. There seems to be a problem with wget
over-zealously obeying robot exclusion when --page-requisites is used,
even when only downloading a single URL.


I attempted to download a single web page, specifying --page-requisites so
that images, css and javascript files required by the page are also
downloaded:
  wget -x -S --page-requisites http://www.example.com/path/file.html

In the HTML page downloaded, there was this line:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

The presence of that line causes wget to not download the page requisites.
(And there is nothing in the log output to indicate it is ignoring
--page-requisites.)

I think wget should not pay attention to robot exclusion when downloading
page requisites.

Typically, you won't know whether a particular page you're about to
download has a robots line in its HTML source. So you need to specify "-e
robots=off" whenever you use --page-requisites, to ensure all requisites
are downloaded.

But in cases where you *are* recursively downloading and using
--page-requisites, it would be polite to otherwise obey the robots
exclusion standard by default. Which you can't do if you have to use -e
robots=off to ensure all requisites are downloaded.


Mark

[Bug-wget] --page-requisites and robot exclusion issue

Reply via email to