[ http://issues.apache.org/jira/browse/NUTCH-56?page=all ]
Andy Liu updated NUTCH-56:
--------------------------
Attachment: robots_403.patch
Add configuration parameter to allow the crawling of sites where a 403 is
returned when accessing robots.txt
> Crawling sites with 403 Forbidden robots.txt
> --------------------------------------------
>
> Key: NUTCH-56
> URL: http://issues.apache.org/jira/browse/NUTCH-56
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Andy Liu
> Priority: Minor
> Attachments: robots_403.patch
>
> If a 403 error is encountered when trying to access the robots.txt file,
> Nutch does not crawl any pages from that site. This behavior is consistent
> with the RFC recommendation for the robot exclusion protocol.
> However, Google does crawl sites that exhibit this type of behavior, because
> most webmasters of these sites are unaware of robots.txt conventions and do
> want their site to be crawled.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira