Sebastian Nagel created NUTCH-2801:
--------------------------------------
Summary: RobotsRulesParser command-line checker to use
http.robots.agents as fall-back
Key: NUTCH-2801
URL: https://issues.apache.org/jira/browse/NUTCH-2801
Project: Nutch
Issue Type: Bug
Components: checker, robots
Affects Versions: 1.17
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Fix For: 1.18
The RobotsRulesParser command-line tool, used to check a list of URLs against
one robots.txt file, should use the value of the property
{{http.robots.agents}} as fall-back if no user agent names are explicitly given
as command-line argument. In this case it should behave same as the robots.txt
parser, looking first for {{http.agent.name}}, then for other names listed in
{{http.robots.agents}}, finally picking the rules for {{User-agent: *}}
{noformat}
$> cat robots.txt
User-agent: Nutch
Allow: /
User-agent: *
Disallow: /
$> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
-Dhttp.agent.name=mybot \
-Dhttp.robots.agents='nutch,goodbot' \
robots.txt urls.txt
Testing robots.txt for agent names: mybot,nutch,goodbot
not allowed: https://www.example.com/
{noformat}
The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. Only
the name "mybot" is actually checked.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)