Re: expected throughput

Andrzej Bialecki Wed, 22 Aug 2007 11:25:51 -0700

David Bargeron wrote:

Hi - We are running Nutch 0.9 on a 9-machine cluster. We are doing a general
web crawl for text documents (html, pdf, doc, txt, etc). We are getting
about 580k documents fully indexed every 24 hours. Is this an expected level
of throughput, or should it be higher? It seems low to me.

It depends on the distribution of hosts in your fetchlists - if thereare few unique hosts Nutch will wait most of the time, in order to obeycrawl-delay limitations.

If you crawl many unique hosts, you should be able to fetch ~50-100pages / sec on a single node, depending this time on your bandwidth, theDNS setup, and the bandwidth of the target sites.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: expected throughput

Reply via email to