Hello,

Do you have enough space. I noticed that nutch downloads content of those pages 
and uses it as cashed version. Try do disable cashing.
I fetched a couple of pages and my data file is already about 8MB.

Alex.


 


 

-----Original Message-----
From: Josh Attenberg <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sat, 29 Dec 2007 5:59 pm
Subject: Re: Nutch - crashed during a large fetch, how to restart?










anyway, i've spent like 6 months trying to get a large crawl with nutch. $20
to anyone who can show me how to fetch ~100 million pages, compressed, and
allow me to access both the content (with or with out tags) and the url
graph.
i have paypal!

On 12/29/07, Josh Attenberg <[EMAIL PROTECTED]> wrote:
>
> but then I will be missing many URLs,  no? I need a big crawl, I wish I
> could split big fetch tasks in to many smaller fetches.
>
> On Dec 28, 2007 11:57 AM, Dennis Kubes < [EMAIL PROTECTED]> wrote:
>
> > There will always be a single segments folder.  Under that if using the
> > DFS there should be one part-xxxxx folder in each of the segments
> > sub-folders (i.e. content) for each reduce task.  You can control the
> > number of urls to fetch with the -topN option on the command line.
> >
> > Dennis Kubes
> >
> > Josh Attenberg wrote:
> > > can anyone tell me how to control how many segments get created by
> > generate?
> > > I tried mergesegs, but that always seems to crash.
> > >
> > > On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >> regarding point #1 - in nutch 7 & 8, how can one specify the number
> > of
> > >> urls that go into a segment? I guess an equivalent question is, how
> > to
> > >> control how many segments get created by generate?
> > >>
> > >>  On Dec 19, 2007 6:23 PM, Dennis Kubes < [EMAIL PROTECTED]> wrote:
> > >>
> > >>> you can delete everything in segments except crawl_generate then
> > start
> > >>> the fetch again.  Here are some hints about fetching:
> > >>>
> > >>> 1) if you can do smaller fetches in the 1-5 million page range, then
> >
> > >>> aggregate together.  that way if something goes wrong you haven't
> > lost
> > >>> alot.
> > >>>
> > >>> 2) setting generate.max.per.host to some a small number instead of
> > -1
> > >>> will make fetches run faster but won't get all pages from a single
> > site.
> > >>>  This is good if you are doing general web crawls.
> > >>>
> > >>> 3) avoid using regex-urlfilter if possible, prefix and suffix url
> > filter
> > >>> tend to work much better and don't cause stalls.
> > >>>
> > >>> 4) setting fetcher.max.crawl.delay (default is 30 seconds) will
> > avoid
> > >>> pages with long crawl delays.  again this makes fetching go faster.
> > >>>
> > >>> Hope this helps.
> > >>>
> > >>> Dennis Kubes
> > >>>
> > >>> Josh Attenberg wrote:
> > >>>> I was fetching a large segment, ~30million urls, when something
> > weird
> > >>>> happened and the fetcher crashed. I know I can't recover the
> > portion
> > >>> of what
> > >>>> I have fetched already, but i'd like to start over. When I try to
> > do
> > >>> this,
> > >>>> I get the following error:  already exists:
> > >>> segments/20071214095336/fetcher
> > >>>> what can I do to re-try this? can i just delete that file and try
> > >>> over? I
> > >>>> have had nothing but trouble performing large fetches with nutch,
> > but
> > >>> I
> > >>>> can't give up! please help!
> > >>>>
> > >>
> > >
> >
>
>



 


________________________________________________________________________
More new features than ever.  Check out the new AIM(R) Mail ! - 
http://webmail.aim.com

Reply via email to