<QUOTE url= http://www.robotstxt.org/wc/norobots.html > The format and semantics of the "/robots.txt" file are as follows: The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>". The field name is case insensitive. ... The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.
User-agent The value of this field is the name of the robot the record is describing access policy for. If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record. The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended. If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file. Disallow The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record. </QUOTE> ================ I analyzed your robots.txt file. Nutch should understand it. This is the record for Nutch from your robots.txt: User-agent: * Disallow: /blog/?m Disallow: /blog?m Disallow: /blog/?c Disallow: /blog?c Disallow: /blog/index.php? Disallow: /blog?p Disallow: /blog/index.php/ Disallow: /blog/narchives2.php/ Disallow: /blog/narchives2.php? Disallow: /blog/wp- Disallow: /blog/feed/ Disallow: /blog/?feed Disallow: /blog/pic Disallow: /blog/?1 Disallow: /blog/?2 Disallow: /blog/?3 Disallow: /blog/?4 Disallow: /blog/?5 Disallow: /blog/?6 Disallow: /blog/?7 Disallow: /blog/?8 Disallow: /blog/?9 Disallow: /blog/?0 Disallow: /blog/get Disallow: /blog/xml Disallow: /pictures/pics Disallow: /pictures/thumbs Disallow: /pictures/newpics Disallow: /pictures/ecle Disallow: /contact Disallow: /gb Disallow: /pictures/p01/thum Disallow: /pictures/p02/thum Disallow: /pictures/p03/thum Disallow: /pictures/p04/thum Disallow: /pictures/p05/thum Disallow: /pictures/p06/thum Disallow: /pictures/p07/thum Disallow: /pictures/p08/thum Disallow: /pictures/p09/thum Disallow: /pictures/p10/thum Disallow: /pictures/p11/thum Disallow: /pictures/p12/thum Disallow: /pictures/p13/thum Disallow: /pictures/p14/thum Disallow: /xhtmlpics Disallow: /archives/afh Disallow: /archives/herb Disallow: /archives/culi Disallow: /thum >That gbx.php is my guestbook, which I've blocked in robots.txt. === User-agent: * Disallow: /gb === >They hit a bot trap later on and got blocked, but nutch only picked up 3 files >after it got the first 403. Nutch should not attempt to "GET /gbx.php HTTP/1.0" Thanks -----Original Message----- From: Henriette Kress Sent: Friday, January 20, 2006 2:07 AM To: nutch-agent@lucene.apache.org Subject: cairo.ee.ucla.edu: nutch didn't obey robots.txt