[ http://issues.apache.org/jira/browse/NUTCH-56?page=all ] Andrzej Bialecki closed NUTCH-56: ----------------------------------
Resolution: Fixed Applied. I changed the name of the property to follow an already existing "http.robots.*" hierarchy. Thanks! > Crawling sites with 403 Forbidden robots.txt > -------------------------------------------- > > Key: NUTCH-56 > URL: http://issues.apache.org/jira/browse/NUTCH-56 > Project: Nutch > Type: Improvement > Components: fetcher > Reporter: Andy Liu > Priority: Minor > Attachments: robots_403.patch > > If a 403 error is encountered when trying to access the robots.txt file, > Nutch does not crawl any pages from that site. This behavior is consistent > with the RFC recommendation for the robot exclusion protocol. > However, Google does crawl sites that exhibit this type of behavior, because > most webmasters of these sites are unaware of robots.txt conventions and do > want their site to be crawled. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers