That's probably why you should put in the * in first position in the config file. (see comment there).

  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>

Am 01.08.2007, 18:07 Uhr, schrieb Michael Böckling <[EMAIL PROTECTED]>:

Hi!

I experimented with the robots.txt parser component of nutch, and it seems that it does not work as it should. The call to RobotRulesParser.getRobotRulesSet() returns only the entry with the highest precedence, which is depending on the order of the values of the "http.robots.agents" configuration directive.

Here's an example:


robots.txt:
User-agent: *
Disallow: /some/rule/
User-agent: nutch
Disallow: /some/other/rule/

configuration:
http.robots.agents=nutch,*

==> the ruleset for "User-agent: *" is ignored


Expected behaviour: the "*" rules should be applied in every case.

Reason: That is because the parser only returns "bestRulesSoFar" (actual name of the variable).


Is this bug known, and if yes is there a workaround or fix?

Thanks for any help!

Regards,

Michael





--
Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/

Reply via email to