Re: Nutch - crashed during a large fetch, how to restart?

Josh Attenberg Fri, 28 Dec 2007 07:50:53 -0800

can anyone tell me how to control how many segments get created by generate?
I tried mergesegs, but that always seems to crash.


On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]> wrote:

> regarding point #1 - in nutch 7 & 8, how can one specify the number of
> urls that go into a segment? I guess an equivalent question is, how to
> control how many segments get created by generate?
>
>  On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>
> > you can delete everything in segments except crawl_generate then start
> > the fetch again.  Here are some hints about fetching:
> >
> > 1) if you can do smaller fetches in the 1-5 million page range, then
> > aggregate together.  that way if something goes wrong you haven't lost
> > alot.
> >
> > 2) setting generate.max.per.host to some a small number instead of -1
> > will make fetches run faster but won't get all pages from a single site.
> >  This is good if you are doing general web crawls.
> >
> > 3) avoid using regex-urlfilter if possible, prefix and suffix url filter
> > tend to work much better and don't cause stalls.
> >
> > 4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid
> > pages with long crawl delays.  again this makes fetching go faster.
> >
> > Hope this helps.
> >
> > Dennis Kubes
> >
> > Josh Attenberg wrote:
> > > I was fetching a large segment, ~30million urls, when something weird
> > > happened and the fetcher crashed. I know I can't recover the portion
> > of what
> > > I have fetched already, but i'd like to start over. When I try to do
> > this,
> > > I get the following error:  already exists:
> > segments/20071214095336/fetcher
> > >
> > > what can I do to re-try this? can i just delete that file and try
> > over? I
> > > have had nothing but trouble performing large fetches with nutch, but
> > I
> > > can't give up! please help!
> > >
> >
>
>

Re: Nutch - crashed during a large fetch, how to restart?

Reply via email to