Sebastian Nagel created NUTCH-2801:
--------------------------------------

             Summary: RobotsRulesParser command-line checker to use 
http.robots.agents as fall-back
                 Key: NUTCH-2801
                 URL: https://issues.apache.org/jira/browse/NUTCH-2801
             Project: Nutch
          Issue Type: Bug
          Components: checker, robots
    Affects Versions: 1.17
            Reporter: Sebastian Nagel
            Assignee: Sebastian Nagel
             Fix For: 1.18


The RobotsRulesParser command-line tool, used to check a list of URLs against 
one robots.txt file, should use the value of the property 
{{http.robots.agents}} as fall-back if no user agent names are explicitly given 
as command-line argument. In this case it should behave same as the robots.txt 
parser, looking first for {{http.agent.name}}, then for other names listed in 
{{http.robots.agents}}, finally picking the rules for {{User-agent: *}}

{noformat}
$> cat robots.txt
User-agent: Nutch
Allow: /
User-agent: *
Disallow: /

$> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
      -Dhttp.agent.name=mybot \
      -Dhttp.robots.agents='nutch,goodbot' \
      robots.txt urls.txt 
Testing robots.txt for agent names: mybot,nutch,goodbot
not allowed:    https://www.example.com/
{noformat}

The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. Only 
the name "mybot" is actually checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to