Emmanuel wrote:
Yes i'm using Trunk.

I think I found my pb. Actually it does work perfectly with one thread
per host but
if you set 2 threads per host, it doesn't wait crawlDelay. I've
configured my Nutch
to use 2 thread per hosts, that's why i had this issue.

In the code I can find
nextFetchTime.set(endTime + (maxThreads > 1 ? minCrawlDelay : crawlDelay));
and
this.minCrawlDelay = (long) (conf.getFloat("fetcher.server.min.delay",
0.0f) * 1000);

But fetcher.server.min.delay is not define in nutch-default.xml. So
minCrawlDelay =
0 seconds. It keep crawling without waiting.

However I'm wondering why do we have 2 delay ( minCrawlDelay  and
crawlDelay ) and
why minCrawlDelay is set to 0 ? is it a bug ?

Could you please help me to understand ?

First, please see the discussion here: http://issues.apache.org/jira/browse/NUTCH-385

The crawlDelay value is set from the default value configured in the config files, and then adjusted to a different value if it's specified in the host's robots.txt. However, in this case this limit doesn't apply, as explained in NUTCH-385.

The fetcher.server.min.delay value is exactly to prevent too rapid crawling if you set the number of threads per host > 1.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to