Re: Nutch - crashed during a large fetch, how to restart?

Dennis Kubes Fri, 28 Dec 2007 08:58:30 -0800

There will always be a single segments folder. Under that if using theDFS there should be one part-xxxxx folder in each of the segmentssub-folders (i.e. content) for each reduce task. You can control thenumber of urls to fetch with the -topN option on the command line.


Dennis Kubes


Josh Attenberg wrote:

can anyone tell me how to control how many segments get created by generate?
I tried mergesegs, but that always seems to crash.

On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]> wrote:

regarding point #1 - in nutch 7 & 8, how can one specify the number of
urls that go into a segment? I guess an equivalent question is, how to
control how many segments get created by generate?

 On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

you can delete everything in segments except crawl_generate then start
the fetch again.  Here are some hints about fetching:

1) if you can do smaller fetches in the 1-5 million page range, then
aggregate together.  that way if something goes wrong you haven't lost
alot.

2) setting generate.max.per.host to some a small number instead of -1
will make fetches run faster but won't get all pages from a single site.
 This is good if you are doing general web crawls.

3) avoid using regex-urlfilter if possible, prefix and suffix url filter
tend to work much better and don't cause stalls.

4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid
pages with long crawl delays.  again this makes fetching go faster.

Hope this helps.

Dennis Kubes

Josh Attenberg wrote:

I was fetching a large segment, ~30million urls, when something weird
happened and the fetcher crashed. I know I can't recover the portion

of what

I have fetched already, but i'd like to start over. When I try to do

this,

I get the following error:  already exists:

segments/20071214095336/fetcher

what can I do to re-try this? can i just delete that file and try

over? I

have had nothing but trouble performing large fetches with nutch, but

can't give up! please help!

Re: Nutch - crashed during a large fetch, how to restart?

Reply via email to