[email protected] wrote:
The list contains at least several thousands of unique hosts. Does FetcherThread randomly pick a URL from the fetchlist or does it choose alphabetically? There are some sites containing about 30 or so links to its own domain, so it wouldn't be surprising if my threads are blocked if FetcherThread picks alphabetically. But then, is there a way to make it pick randomly?
The fetchlist is produced by Generator, and Generator attempts to put URL-s from the same hosts as far apart as possible. This fetchlist is then processed in sequence - so the fetcher should attempt to access URL-s from different hosts, in sequence.
Another thing to check: you can add some logging around FetchItemQueues to log the longest queues, the number of items and the crawl-delay on that queue.
And finally - Todd Lipcon is working on an improved version of Fetcher2, which should replace both Fetcher and Fetcher2. Please see this issue for more details: https://issues.apache.org/jira/browse/NUTCH-669 .
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
