David Bargeron wrote:
Hi - We are running Nutch 0.9 on a 9-machine cluster. We are doing a general
web crawl for text documents (html, pdf, doc, txt, etc). We are getting
about 580k documents fully indexed every 24 hours. Is this an expected level
of throughput, or should it be higher? It seems low to me.


It depends on the distribution of hosts in your fetchlists - if there are few unique hosts Nutch will wait most of the time, in order to obey crawl-delay limitations.

If you crawl many unique hosts, you should be able to fetch ~50-100 pages / sec on a single node, depending this time on your bandwidth, the DNS setup, and the bandwidth of the target sites.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to