David Bargeron wrote:
Hi - We are running Nutch 0.9 on a 9-machine cluster. We are doing a general web crawl for text documents (html, pdf, doc, txt, etc). We are getting about 580k documents fully indexed every 24 hours. Is this an expected level of throughput, or should it be higher? It seems low to me.
It depends on the distribution of hosts in your fetchlists - if there are few unique hosts Nutch will wait most of the time, in order to obey crawl-delay limitations.
If you crawl many unique hosts, you should be able to fetch ~50-100 pages / sec on a single node, depending this time on your bandwidth, the DNS setup, and the bandwidth of the target sites.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
