Re: Nutch - crashed during a large fetch, how to restart?

Dennis Kubes Wed, 19 Dec 2007 15:23:49 -0800

you can delete everything in segments except crawl_generate then startthe fetch again. Here are some hints about fetching:

1) if you can do smaller fetches in the 1-5 million page range, thenaggregate together. that way if something goes wrong you haven't lost alot.

2) setting generate.max.per.host to some a small number instead of -1will make fetches run faster but won't get all pages from a single site.This is good if you are doing general web crawls.

3) avoid using regex-urlfilter if possible, prefix and suffix url filtertend to work much better and don't cause stalls.

4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoidpages with long crawl delays. again this makes fetching go faster.


Hope this helps.

Dennis Kubes

Josh Attenberg wrote:

I was fetching a large segment, ~30million urls, when something weird
happened and the fetcher crashed. I know I can't recover the portion of what
I have fetched already, but i'd like to start over. When I try to do this,
I get the following error:  already exists: segments/20071214095336/fetcher

what can I do to re-try this? can i just delete that file and try over? I
have had nothing but trouble performing large fetches with nutch, but I
can't give up! please help!

Re: Nutch - crashed during a large fetch, how to restart?

Reply via email to