Hello, Do you have enough space. I noticed that nutch downloads content of those pages and uses it as cashed version. Try do disable cashing. I fetched a couple of pages and my data file is already about 8MB.
Alex. -----Original Message----- From: Josh Attenberg <[EMAIL PROTECTED]> To: [email protected] Sent: Sat, 29 Dec 2007 5:59 pm Subject: Re: Nutch - crashed during a large fetch, how to restart? anyway, i've spent like 6 months trying to get a large crawl with nutch. $20 to anyone who can show me how to fetch ~100 million pages, compressed, and allow me to access both the content (with or with out tags) and the url graph. i have paypal! On 12/29/07, Josh Attenberg <[EMAIL PROTECTED]> wrote: > > but then I will be missing many URLs, no? I need a big crawl, I wish I > could split big fetch tasks in to many smaller fetches. > > On Dec 28, 2007 11:57 AM, Dennis Kubes < [EMAIL PROTECTED]> wrote: > > > There will always be a single segments folder. Under that if using the > > DFS there should be one part-xxxxx folder in each of the segments > > sub-folders (i.e. content) for each reduce task. You can control the > > number of urls to fetch with the -topN option on the command line. > > > > Dennis Kubes > > > > Josh Attenberg wrote: > > > can anyone tell me how to control how many segments get created by > > generate? > > > I tried mergesegs, but that always seems to crash. > > > > > > On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]> > > wrote: > > > > > >> regarding point #1 - in nutch 7 & 8, how can one specify the number > > of > > >> urls that go into a segment? I guess an equivalent question is, how > > to > > >> control how many segments get created by generate? > > >> > > >> On Dec 19, 2007 6:23 PM, Dennis Kubes < [EMAIL PROTECTED]> wrote: > > >> > > >>> you can delete everything in segments except crawl_generate then > > start > > >>> the fetch again. Here are some hints about fetching: > > >>> > > >>> 1) if you can do smaller fetches in the 1-5 million page range, then > > > > >>> aggregate together. that way if something goes wrong you haven't > > lost > > >>> alot. > > >>> > > >>> 2) setting generate.max.per.host to some a small number instead of > > -1 > > >>> will make fetches run faster but won't get all pages from a single > > site. > > >>> This is good if you are doing general web crawls. > > >>> > > >>> 3) avoid using regex-urlfilter if possible, prefix and suffix url > > filter > > >>> tend to work much better and don't cause stalls. > > >>> > > >>> 4) setting fetcher.max.crawl.delay (default is 30 seconds) will > > avoid > > >>> pages with long crawl delays. again this makes fetching go faster. > > >>> > > >>> Hope this helps. > > >>> > > >>> Dennis Kubes > > >>> > > >>> Josh Attenberg wrote: > > >>>> I was fetching a large segment, ~30million urls, when something > > weird > > >>>> happened and the fetcher crashed. I know I can't recover the > > portion > > >>> of what > > >>>> I have fetched already, but i'd like to start over. When I try to > > do > > >>> this, > > >>>> I get the following error: already exists: > > >>> segments/20071214095336/fetcher > > >>>> what can I do to re-try this? can i just delete that file and try > > >>> over? I > > >>>> have had nothing but trouble performing large fetches with nutch, > > but > > >>> I > > >>>> can't give up! please help! > > >>>> > > >> > > > > > > > ________________________________________________________________________ More new features than ever. Check out the new AIM(R) Mail ! - http://webmail.aim.com
