Thanks. How can I determine how many unique hosts there are in my fetchlists? And if it turns out there are not many unique hosts, can I force Nutch to favor many unique hosts?
Thanks Dave -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 22, 2007 11:25 AM To: [email protected] Subject: Re: expected throughput David Bargeron wrote: > Hi - We are running Nutch 0.9 on a 9-machine cluster. We are doing a general > web crawl for text documents (html, pdf, doc, txt, etc). We are getting > about 580k documents fully indexed every 24 hours. Is this an expected level > of throughput, or should it be higher? It seems low to me. > It depends on the distribution of hosts in your fetchlists - if there are few unique hosts Nutch will wait most of the time, in order to obey crawl-delay limitations. If you crawl many unique hosts, you should be able to fetch ~50-100 pages / sec on a single node, depending this time on your bandwidth, the DNS setup, and the bandwidth of the target sites. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
