[ http://issues.apache.org/jira/browse/NUTCH-56?page=all ]

Andy Liu updated NUTCH-56:
--------------------------

    Attachment: robots_403.patch

Add configuration parameter to allow the crawling of sites where a 403 is 
returned when accessing robots.txt

> Crawling sites with 403 Forbidden robots.txt
> --------------------------------------------
>
>          Key: NUTCH-56
>          URL: http://issues.apache.org/jira/browse/NUTCH-56
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Andy Liu
>     Priority: Minor
>  Attachments: robots_403.patch
>
> If a 403 error is encountered when trying to access the robots.txt file, 
> Nutch does not crawl any pages from that site.  This behavior is consistent 
> with the RFC recommendation for the robot exclusion protocol.  
> However, Google does crawl sites that exhibit this type of behavior, because 
> most webmasters of these sites are unaware of robots.txt conventions and do 
> want their site to be crawled.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to