hi Michael,
is it similar to https://issues.apache.org/jira/browse/NUTCH-98 ? (I
just type "robot" in the search field of Nutch's JIRA at
https://issues.apache.org/jira/browse/NUTCH)
HTH,
Renaud
Michael Böckling wrote:
Hi!
I experimented with the robots.txt parser component of nutch, and it seems that it does
not work as it should. The call to RobotRulesParser.getRobotRulesSet() returns only the
entry with the highest precedence, which is depending on the order of the values of the
"http.robots.agents" configuration directive.
Here's an example:
robots.txt:
User-agent: *
Disallow: /some/rule/
User-agent: nutch
Disallow: /some/other/rule/
configuration:
http.robots.agents=nutch,*
==> the ruleset for "User-agent: *" is ignored
Expected behaviour: the "*" rules should be applied in every case.
Reason: That is because the parser only returns "bestRulesSoFar" (actual name
of the variable).
Is this bug known, and if yes is there a workaround or fix?
Thanks for any help!
Regards,
Michael