I don't see immediately what would be the best solution to this problem. One way to solve this would be for the CrawlTool to automatically adjust "fetcher.threads.per.host" value so that it follows the formula "fetcher.threads / number of urls in the urllist". I did this manually in the config file, and on the command line to the CrawlTool, so can you to avoid this problem for now.
Any comments?
We should not set fetcher.threads.per.host above one by default, since that makes the fetcher impolite, ignoring fetcher.server.delay.
A better way to fix this might be to add per-host queues of urls to be fetched, waiting for a thread. When a maximum number of urls are queued (globally? per-host?), then the fetcher can either: (a) pause until the queues diminish; or (b) discard some urls. When the fetchlist is exhausted, the fetcher may decide to drop all queued urls, or patiently wait until all queues are exhausted. The current implementation is effectively (b) where threads are used as crude queues. The old RequestScheduler fetcher implementation used queues as described here, but it had bugs and would sometimes lock up.
It comes down to three modes of fetcher operation:
1. impolite but fast;
2. polite but slow, fetching everything;
3. polite but fast, dropping urls from sites that are slow and/or have many pages to fetch.
Currently, mode (1) is selected by setting fetcher.server.delay to something small, and fetcher.threads.per.host. Mode (2) is selected by setting fetcher.max.delays to something very large. Mode (3) is selected by setting fetcher.max.delays to something small.
The fetcher must always be polite by default. Folks may configure it otherwise for intranet crawling of their own servers but should never crawl the open web with an impolite Nutch-based crawler.
Doug
