Jay created NUTCH-2800:
--------------------------

             Summary: Outdated information in documentation about catch all 
user agent
                 Key: NUTCH-2800
                 URL: https://issues.apache.org/jira/browse/NUTCH-2800
             Project: Nutch
          Issue Type: Bug
          Components: documentation
            Reporter: Jay


It's mentioned on this page [http://nutch.apache.org/bot.html] that all Nutch 
based crawlers will respond to the user agent name "Nutch" irrespective of what 
the actual user agent name(s) have been set through the conf (nutch-site.xml). 

The page recommends that a webmaster can ban all Nutch based crawlers by simply 
putting this in robots.txt file.

User-agent: Nutch
Disallow: /

I tested crawling a site with Nutch 1.6 variant (common crawl fork) with 
another user agent name with a site ( 
[https://store.stockcharts.com/robots.txt|https://www.google.com/url?q=https%3A%2F%2Fstore.stockcharts.com%2Frobots.txt&sa=D&sntz=1&usg=AFQjCNGp8w8oSSx_NUZZFZuioUCqUe40ww]
 ) containing this in robots.txt and Nutch allowed me to fetch the page so this 
catch-all type user agent isn't working and the documentation should be updated 
to reflect this change in behavior. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to