[
https://issues.apache.org/jira/browse/NUTCH-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jay updated NUTCH-2800:
-----------------------
Priority: Major (was: Minor)
> Outdated information in documentation about catch all user agent
> ----------------------------------------------------------------
>
> Key: NUTCH-2800
> URL: https://issues.apache.org/jira/browse/NUTCH-2800
> Project: Nutch
> Issue Type: Bug
> Components: documentation
> Reporter: Jay
> Priority: Major
>
> It's mentioned on this page [http://nutch.apache.org/bot.html] that all Nutch
> based crawlers will respond to the user agent name "Nutch" irrespective of
> what the actual user agent name(s) have been set through the conf
> (nutch-site.xml).
> The page recommends that a webmaster can ban all Nutch based crawlers by
> simply putting this in robots.txt file.
> User-agent: Nutch
> Disallow: /
> I tested crawling a site with Nutch 1.6 variant (common crawl fork) with
> another user agent name with a site (
> [https://store.stockcharts.com/robots.txt|https://www.google.com/url?q=https%3A%2F%2Fstore.stockcharts.com%2Frobots.txt&sa=D&sntz=1&usg=AFQjCNGp8w8oSSx_NUZZFZuioUCqUe40ww]
> ) containing this in robots.txt and Nutch allowed me to fetch the page so
> this catch-all type user agent isn't working and the documentation should be
> updated to reflect this change in behavior.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)