you can delete everything in segments except crawl_generate then start
the fetch again. Here are some hints about fetching:
1) if you can do smaller fetches in the 1-5 million page range, then
aggregate together. that way if something goes wrong you haven't lost alot.
2) setting generate.max.per.host to some a small number instead of -1
will make fetches run faster but won't get all pages from a single site.
This is good if you are doing general web crawls.
3) avoid using regex-urlfilter if possible, prefix and suffix url filter
tend to work much better and don't cause stalls.
4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid
pages with long crawl delays. again this makes fetching go faster.
Hope this helps.
Dennis Kubes
Josh Attenberg wrote:
I was fetching a large segment, ~30million urls, when something weird
happened and the fetcher crashed. I know I can't recover the portion of what
I have fetched already, but i'd like to start over. When I try to do this,
I get the following error: already exists: segments/20071214095336/fetcher
what can I do to re-try this? can i just delete that file and try over? I
have had nothing but trouble performing large fetches with nutch, but I
can't give up! please help!