Jay created NUTCH-2800:
--------------------------
Summary: Outdated information in documentation about catch all
user agent
Key: NUTCH-2800
URL: https://issues.apache.org/jira/browse/NUTCH-2800
Project: Nutch
Issue Type: Bug
Components: documentation
Reporter: Jay
It's mentioned on this page [http://nutch.apache.org/bot.html] that all Nutch
based crawlers will respond to the user agent name "Nutch" irrespective of what
the actual user agent name(s) have been set through the conf (nutch-site.xml).
The page recommends that a webmaster can ban all Nutch based crawlers by simply
putting this in robots.txt file.
User-agent: Nutch
Disallow: /
I tested crawling a site with Nutch 1.6 variant (common crawl fork) with
another user agent name with a site (
[https://store.stockcharts.com/robots.txt|https://www.google.com/url?q=https%3A%2F%2Fstore.stockcharts.com%2Frobots.txt&sa=D&sntz=1&usg=AFQjCNGp8w8oSSx_NUZZFZuioUCqUe40ww]
) containing this in robots.txt and Nutch allowed me to fetch the page so this
catch-all type user agent isn't working and the documentation should be updated
to reflect this change in behavior.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)