Emmanuel wrote:
Yes i'm using Trunk.I think I found my pb. Actually it does work perfectly with one thread per host but if you set 2 threads per host, it doesn't wait crawlDelay. I've configured my Nutch to use 2 thread per hosts, that's why i had this issue. In the code I can find nextFetchTime.set(endTime + (maxThreads > 1 ? minCrawlDelay : crawlDelay)); and this.minCrawlDelay = (long) (conf.getFloat("fetcher.server.min.delay", 0.0f) * 1000); But fetcher.server.min.delay is not define in nutch-default.xml. So minCrawlDelay = 0 seconds. It keep crawling without waiting. However I'm wondering why do we have 2 delay ( minCrawlDelay and crawlDelay ) and why minCrawlDelay is set to 0 ? is it a bug ? Could you please help me to understand ?
First, please see the discussion here: http://issues.apache.org/jira/browse/NUTCH-385
The crawlDelay value is set from the default value configured in the config files, and then adjusted to a different value if it's specified in the host's robots.txt. However, in this case this limit doesn't apply, as explained in NUTCH-385.
The fetcher.server.min.delay value is exactly to prevent too rapid crawling if you set the number of threads per host > 1.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
