Re: Nutch - crashed during a large fetch, how to restart?

Josh Attenberg Thu, 20 Dec 2007 08:19:34 -0800

thanks, but after trying with nutch 7, 8, and 9 for several months, I have
never had any fetches that havent crashed, seemingly irreparably before the
third iteration.


Often, the fetch completes, but the program remains running, for a long
time. if i kill it, then the crawl is ruined, but it seems to run forever,
an endless loop.

On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> you can delete everything in segments except crawl_generate then start
> the fetch again.  Here are some hints about fetching:
>
> 1) if you can do smaller fetches in the 1-5 million page range, then
> aggregate together.  that way if something goes wrong you haven't lost
> alot.
>
> 2) setting generate.max.per.host to some a small number instead of -1
> will make fetches run faster but won't get all pages from a single site.
>  This is good if you are doing general web crawls.
>
> 3) avoid using regex-urlfilter if possible, prefix and suffix url filter
> tend to work much better and don't cause stalls.
>
> 4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid
> pages with long crawl delays.  again this makes fetching go faster.
>
> Hope this helps.
>
> Dennis Kubes
>
> Josh Attenberg wrote:
> > I was fetching a large segment, ~30million urls, when something weird
> > happened and the fetcher crashed. I know I can't recover the portion of
> what
> > I have fetched already, but i'd like to start over. When I try to do
> this,
> > I get the following error:  already exists:
> segments/20071214095336/fetcher
> >
> > what can I do to re-try this? can i just delete that file and try over?
> I
> > have had nothing but trouble performing large fetches with nutch, but I
> > can't give up! please help!
> >
>

Re: Nutch - crashed during a large fetch, how to restart?

Reply via email to