Re: lib-http crawl-delay problem

rubdabadub Thu, 15 Feb 2007 03:45:38 -0800

Hi:

I am unable to get the attached patch via mail. Its better if you
create a JIra issue and attached the patch there.


Thank you.

On 2/15/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:

Hi,

There seems to be two small bugs in lib-http's RobotRulesParser.

First is about reading crawl-delay. The code doesn't check for addRules,
so the nutch bot will get the crawl-delay value of another robot's
crawl-delay in robots.txt. Let me try to be more clear:

User-agent: foobot
Crawl-delay: 3600

User-agent: *
Disallow:


In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.

Second is about main method. RobotRulesParser.main advertises its usage
as "<robots-file> <url-file> <agent-name>+" but if you give it more than
one agent time it refuses it.

Trivial patch attached.

--
Doğacan Güney

Re: lib-http crawl-delay problem

Reply via email to