Re: Nutch - crashed during a large fetch, how to restart?

Andrzej Bialecki Sun, 30 Dec 2007 23:41:59 -0800

Josh Attenberg wrote:

anyway, i've spent like 6 months trying to get a large crawl with nutch. $20
to anyone who can show me how to fetch ~100 million pages, compressed, and
allow me to access both the content (with or with out tags) and the url
graph.

First of all: 100 mln pages is not a small collection. You should beusing DFS and distributed processing. Otherwise you run into limitationsof I/O and memory on a single machine. The preferred setup for thisvolume would be at least 3-5 machines. You should make sure you haveenough disk space to fit both the final content and temporary files(which could be twice as large as the final data files).

Then, you should crawl in smaller increments, e.g. 5-10 mln pages. Youshould use generate -topN - this limits the number of url-s per segment.Further, as Dennis suggested, you should change your configuration toavoid regex urlfilter, which is known to cause problems.


If you follow these suggestions you will be able to fetch 100 mln pages.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch - crashed during a large fetch, how to restart?

Reply via email to