That's probably why you should put in the * in first position in the
config file. (see comment there).
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
Am 01.08.2007, 18:07 Uhr, schrieb Michael Böckling
<[EMAIL PROTECTED]>:
Hi!
I experimented with the robots.txt parser component of nutch, and it
seems that it does not work as it should. The call to
RobotRulesParser.getRobotRulesSet() returns only the entry with the
highest precedence, which is depending on the order of the values of the
"http.robots.agents" configuration directive.
Here's an example:
robots.txt:
User-agent: *
Disallow: /some/rule/
User-agent: nutch
Disallow: /some/other/rule/
configuration:
http.robots.agents=nutch,*
==> the ruleset for "User-agent: *" is ignored
Expected behaviour: the "*" rules should be applied in every case.
Reason: That is because the parser only returns "bestRulesSoFar" (actual
name of the variable).
Is this bug known, and if yes is there a workaround or fix?
Thanks for any help!
Regards,
Michael
--
Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/