RE: cairo.ee.ucla.edu: nutch didn't obey robots.txt

Fuad Efendi Thu, 26 Jan 2006 21:20:14 -0800

<QUOTE url= http://www.robotstxt.org/wc/norobots.html > 
The format and semantics of the "/robots.txt" file are as follows: 
The file consists of one or more records separated by one or more blank
lines (terminated by CR,CR/NL, or NL). Each record contains lines of the
form "<field>:<optionalspace><value><optionalspace>". The field name is case
insensitive.
...
The record starts with one or more User-agent lines, followed by one or more
Disallow lines, as detailed below. Unrecognised headers are ignored.


User-agent 
The value of this field is the name of the robot the record is describing
access policy for. 
If more than one User-agent field is present the record describes an
identical access policy for more than one robot. At least one field needs to
be present per record.

The robot should be liberal in interpreting this field. A case insensitive
substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any
robot that has not matched any of the other records. It is not allowed to
have multiple such records in the "/robots.txt" file.

Disallow 
The value of this field specifies a partial URL that is not to be visited.
This can be a full path, or a partial path; any URL that starts with this
value will not be retrieved. For example, Disallow: /help disallows both
/help.html and /help/index.html, whereas Disallow: /help/ would disallow
/help/index.html but allow /help.html. 
Any empty value, indicates that all URLs can be retrieved. At least one
Disallow field needs to be present in a record.
</QUOTE>

================

I analyzed your robots.txt file. Nutch should understand it.

This is the record for Nutch from your robots.txt:

User-agent: *
Disallow: /blog/?m
Disallow: /blog?m
Disallow: /blog/?c
Disallow: /blog?c
Disallow: /blog/index.php?
Disallow: /blog?p
Disallow: /blog/index.php/
Disallow: /blog/narchives2.php/
Disallow: /blog/narchives2.php?
Disallow: /blog/wp-
Disallow: /blog/feed/
Disallow: /blog/?feed
Disallow: /blog/pic
Disallow: /blog/?1
Disallow: /blog/?2
Disallow: /blog/?3
Disallow: /blog/?4
Disallow: /blog/?5
Disallow: /blog/?6
Disallow: /blog/?7
Disallow: /blog/?8
Disallow: /blog/?9
Disallow: /blog/?0
Disallow: /blog/get
Disallow: /blog/xml
Disallow: /pictures/pics
Disallow: /pictures/thumbs
Disallow: /pictures/newpics
Disallow: /pictures/ecle
Disallow: /contact
Disallow: /gb
Disallow: /pictures/p01/thum
Disallow: /pictures/p02/thum
Disallow: /pictures/p03/thum
Disallow: /pictures/p04/thum
Disallow: /pictures/p05/thum
Disallow: /pictures/p06/thum
Disallow: /pictures/p07/thum
Disallow: /pictures/p08/thum
Disallow: /pictures/p09/thum
Disallow: /pictures/p10/thum
Disallow: /pictures/p11/thum
Disallow: /pictures/p12/thum
Disallow: /pictures/p13/thum
Disallow: /pictures/p14/thum
Disallow: /xhtmlpics
Disallow: /archives/afh
Disallow: /archives/herb
Disallow: /archives/culi
Disallow: /thum




>That gbx.php is my guestbook, which I've blocked in robots.txt.

===
User-agent: *
Disallow: /gb
===


>They hit a bot trap later on and got blocked, but nutch only picked up 3
files 
>after it got the first 403.

Nutch should not attempt to "GET /gbx.php HTTP/1.0"

Thanks


-----Original Message-----
From: Henriette Kress
Sent: Friday, January 20, 2006 2:07 AM
To: nutch-agent@lucene.apache.org
Subject: cairo.ee.ucla.edu: nutch didn't obey robots.txt

RE: cairo.ee.ucla.edu: nutch didn't obey robots.txt

Reply via email to