Hi,

Is there any tutorial on running nutch on a few machines? And how to turn off 
downloading and caching urls content?

Thanks.
Alex.


 


 

-----Original Message-----
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sun, 30 Dec 2007 11:41 pm
Subject: Re: Nutch - crashed during a large fetch, how to restart?









Josh Attenberg wrote:?

> anyway, i've spent like 6 months trying to get a large crawl with nutch. $20?

> to anyone who can show me how to fetch ~100 million pages, compressed, and?

> allow me to access both the content (with or with out tags) and the url?

> graph.?
?

First of all: 100 mln pages is not a small collection. You should be 
using DFS and distributed processing. Otherwise you run into limitations 
of I/O and memory on a single machine. The preferred setup for this 
volume would be at least 3-5 machines. You should make sure you have 
enough disk space to fit both the final content and temporary files 
(which could be twice as large as the final data files).?
?

Then, you should crawl in smaller increments, e.g. 5-10 mln pages. You 
should use generate -topN - this limits the number of url-s per segment. 
Further, as Dennis suggested, you should change your configuration to 
avoid regex urlfilter, which is known to cause problems.?
?

If you follow these suggestions you will be able to fetch 100 mln pages.?
?

-- 
Best regards,?

Andrzej Bialecki     <><?

?___. ___ ___ ___ _ _   __________________________________?

[__ || __|__/|__||\/|  Information Retrieval, Semantic Web?

___|||__||  \|  ||  |  Embedded Unix, System Integration?

http://www.sigram.com  Contact: info at sigram dot com?
?



 


________________________________________________________________________
More new features than ever.  Check out the new AIM(R) Mail ! - 
http://webmail.aim.com

Reply via email to