hi Michael,

is it similar to https://issues.apache.org/jira/browse/NUTCH-98 ? (I just type "robot" in the search field of Nutch's JIRA at https://issues.apache.org/jira/browse/NUTCH)

HTH,
Renaud


Michael Böckling wrote:
Hi!

I experimented with the robots.txt parser component of nutch, and it seems that it does 
not work as it should. The call to RobotRulesParser.getRobotRulesSet() returns only the 
entry with the highest precedence, which is depending on the order of the values of the 
"http.robots.agents" configuration directive.

Here's an example:


robots.txt:
User-agent: * Disallow: /some/rule/
User-agent: nutch
Disallow: /some/other/rule/

configuration:
http.robots.agents=nutch,*

==> the ruleset for "User-agent: *" is ignored


Expected behaviour: the "*" rules should be applied in every case.

Reason: That is because the parser only returns "bestRulesSoFar" (actual name 
of the variable).


Is this bug known, and if yes is there a workaround or fix?
Thanks for any help!

Regards,

Michael



Reply via email to