Josh Attenberg wrote:
anyway, i've spent like 6 months trying to get a large crawl with nutch. $20
to anyone who can show me how to fetch ~100 million pages, compressed, and
allow me to access both the content (with or with out tags) and the url
graph.

First of all: 100 mln pages is not a small collection. You should be using DFS and distributed processing. Otherwise you run into limitations of I/O and memory on a single machine. The preferred setup for this volume would be at least 3-5 machines. You should make sure you have enough disk space to fit both the final content and temporary files (which could be twice as large as the final data files).

Then, you should crawl in smaller increments, e.g. 5-10 mln pages. You should use generate -topN - this limits the number of url-s per segment. Further, as Dennis suggested, you should change your configuration to avoid regex urlfilter, which is known to cause problems.

If you follow these suggestions you will be able to fetch 100 mln pages.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to