[jira] Updated: (NUTCH-56) Crawling sites with 403 Forbidden robots.txt

Andy Liu (JIRA) Mon, 02 May 2005 10:55:08 -0700

     [ http://issues.apache.org/jira/browse/NUTCH-56?page=all ]


Andy Liu updated NUTCH-56:
--------------------------

    Attachment: robots_403.patch

Add configuration parameter to allow the crawling of sites where a 403 is 
returned when accessing robots.txt

> Crawling sites with 403 Forbidden robots.txt
> --------------------------------------------
>
>          Key: NUTCH-56
>          URL: http://issues.apache.org/jira/browse/NUTCH-56
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Andy Liu
>     Priority: Minor
>  Attachments: robots_403.patch
>
> If a 403 error is encountered when trying to access the robots.txt file, 
> Nutch does not crawl any pages from that site.  This behavior is consistent 
> with the RFC recommendation for the robot exclusion protocol.  
> However, Google does crawl sites that exhibit this type of behavior, because 
> most webmasters of these sites are unaware of robots.txt conventions and do 
> want their site to be crawled.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-56) Crawling sites with 403 Forbidden robots.txt

Reply via email to