Hi, Is there any tutorial on running nutch on a few machines? And how to turn off downloading and caching urls content?
Thanks. Alex. -----Original Message----- From: Andrzej Bialecki <[EMAIL PROTECTED]> To: [email protected] Sent: Sun, 30 Dec 2007 11:41 pm Subject: Re: Nutch - crashed during a large fetch, how to restart? Josh Attenberg wrote:? > anyway, i've spent like 6 months trying to get a large crawl with nutch. $20? > to anyone who can show me how to fetch ~100 million pages, compressed, and? > allow me to access both the content (with or with out tags) and the url? > graph.? ? First of all: 100 mln pages is not a small collection. You should be using DFS and distributed processing. Otherwise you run into limitations of I/O and memory on a single machine. The preferred setup for this volume would be at least 3-5 machines. You should make sure you have enough disk space to fit both the final content and temporary files (which could be twice as large as the final data files).? ? Then, you should crawl in smaller increments, e.g. 5-10 mln pages. You should use generate -topN - this limits the number of url-s per segment. Further, as Dennis suggested, you should change your configuration to avoid regex urlfilter, which is known to cause problems.? ? If you follow these suggestions you will be able to fetch 100 mln pages.? ? -- Best regards,? Andrzej Bialecki <><? ?___. ___ ___ ___ _ _ __________________________________? [__ || __|__/|__||\/| Information Retrieval, Semantic Web? ___|||__|| \| || | Embedded Unix, System Integration? http://www.sigram.com Contact: info at sigram dot com? ? ________________________________________________________________________ More new features than ever. Check out the new AIM(R) Mail ! - http://webmail.aim.com
