[
https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170399#comment-17170399
]
Hudson commented on NUTCH-2801:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch-trunk #3694 (See
[https://builds.apache.org/job/Nutch-trunk/3694/])
[NUTCH-2801] RobotsRulesParser command-line checker to use (snagel:
[https://github.com/apache/nutch/commit/f24ccab54773c57fcf6b3a48eae00d977ab5ca6b])
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java
[NUTCH-2801] RobotsRulesParser command-line checker to use (snagel:
[https://github.com/apache/nutch/commit/6801ac79b2f45061ea4bd3b31ffd64e5195cf9c2])
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java
> RobotsRulesParser command-line checker to use http.robots.agents as fall-back
> -----------------------------------------------------------------------------
>
> Key: NUTCH-2801
> URL: https://issues.apache.org/jira/browse/NUTCH-2801
> Project: Nutch
> Issue Type: Bug
> Components: checker, robots
> Affects Versions: 1.17
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 1.18
>
>
> The RobotsRulesParser command-line tool, used to check a list of URLs against
> one robots.txt file, should use the value of the property
> {{http.robots.agents}} as fall-back if no user agent names are explicitly
> given as command-line argument. In this case it should behave same as the
> robots.txt parser, looking first for {{http.agent.name}}, then for other
> names listed in {{http.robots.agents}}, finally picking the rules for
> {{User-agent: *}}
> {noformat}
> $> cat robots.txt
> User-agent: Nutch
> Allow: /
> User-agent: *
> Disallow: /
> $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
> -Dhttp.agent.name=mybot \
> -Dhttp.robots.agents='nutch,goodbot' \
> robots.txt urls.txt
> Testing robots.txt for agent names: mybot,nutch,goodbot
> not allowed: https://www.example.com/
> {noformat}
> The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading.
> Only the name "mybot" is actually checked.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)