There will always be a single segments folder. Under that if using the DFS there should be one part-xxxxx folder in each of the segments sub-folders (i.e. content) for each reduce task. You can control the number of urls to fetch with the -topN option on the command line.

Dennis Kubes

Josh Attenberg wrote:
can anyone tell me how to control how many segments get created by generate?
I tried mergesegs, but that always seems to crash.

On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]> wrote:

regarding point #1 - in nutch 7 & 8, how can one specify the number of
urls that go into a segment? I guess an equivalent question is, how to
control how many segments get created by generate?

 On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

you can delete everything in segments except crawl_generate then start
the fetch again.  Here are some hints about fetching:

1) if you can do smaller fetches in the 1-5 million page range, then
aggregate together.  that way if something goes wrong you haven't lost
alot.

2) setting generate.max.per.host to some a small number instead of -1
will make fetches run faster but won't get all pages from a single site.
 This is good if you are doing general web crawls.

3) avoid using regex-urlfilter if possible, prefix and suffix url filter
tend to work much better and don't cause stalls.

4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid
pages with long crawl delays.  again this makes fetching go faster.

Hope this helps.

Dennis Kubes

Josh Attenberg wrote:
I was fetching a large segment, ~30million urls, when something weird
happened and the fetcher crashed. I know I can't recover the portion
of what
I have fetched already, but i'd like to start over. When I try to do
this,
I get the following error:  already exists:
segments/20071214095336/fetcher
what can I do to re-try this? can i just delete that file and try
over? I
have had nothing but trouble performing large fetches with nutch, but
I
can't give up! please help!



Reply via email to