[jira] [Updated] (NUTCH-2800) Outdated information in documentation about catch all user agent

Jay (Jira) Wed, 08 Jul 2020 20:03:17 -0700


     [ 
https://issues.apache.org/jira/browse/NUTCH-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jay updated NUTCH-2800:
-----------------------
    Priority: Major  (was: Minor)

> Outdated information in documentation about catch all user agent
> ----------------------------------------------------------------
>
>                 Key: NUTCH-2800
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2800
>             Project: Nutch
>          Issue Type: Bug
>          Components: documentation
>            Reporter: Jay
>            Priority: Major
>
> It's mentioned on this page [http://nutch.apache.org/bot.html] that all Nutch 
> based crawlers will respond to the user agent name "Nutch" irrespective of 
> what the actual user agent name(s) have been set through the conf 
> (nutch-site.xml). 
> The page recommends that a webmaster can ban all Nutch based crawlers by 
> simply putting this in robots.txt file.
> User-agent: Nutch
> Disallow: /
> I tested crawling a site with Nutch 1.6 variant (common crawl fork) with 
> another user agent name with a site ( 
> [https://store.stockcharts.com/robots.txt|https://www.google.com/url?q=https%3A%2F%2Fstore.stockcharts.com%2Frobots.txt&sa=D&sntz=1&usg=AFQjCNGp8w8oSSx_NUZZFZuioUCqUe40ww]
>  ) containing this in robots.txt and Nutch allowed me to fetch the page so 
> this catch-all type user agent isn't working and the documentation should be 
> updated to reflect this change in behavior. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2800) Outdated information in documentation about catch all user agent

Reply via email to