RE: expected throughput

David Bargeron Wed, 22 Aug 2007 11:49:15 -0700

Thanks. How can I determine how many unique hosts there are in my
fetchlists? And if it turns out there are not many unique hosts, can I force
Nutch to favor many unique hosts?

Thanks
Dave

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 22, 2007 11:25 AM
To: [email protected]
Subject: Re: expected throughput

David Bargeron wrote:
> Hi - We are running Nutch 0.9 on a 9-machine cluster. We are doing a
general
> web crawl for text documents (html, pdf, doc, txt, etc). We are getting
> about 580k documents fully indexed every 24 hours. Is this an expected
level
> of throughput, or should it be higher? It seems low to me.
> 

It depends on the distribution of hosts in your fetchlists - if there 
are few unique hosts Nutch will wait most of the time, in order to obey 
crawl-delay limitations.

If you crawl many unique hosts, you should be able to fetch ~50-100 
pages / sec on a single node, depending this time on your bandwidth, the 
DNS setup, and the bandwidth of the target sites.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: expected throughput

Reply via email to