but then I will be missing many URLs,  no? I need a big crawl, I wish I
could split big fetch tasks in to many smaller fetches.

On Dec 28, 2007 11:57 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> There will always be a single segments folder.  Under that if using the
> DFS there should be one part-xxxxx folder in each of the segments
> sub-folders (i.e. content) for each reduce task.  You can control the
> number of urls to fetch with the -topN option on the command line.
>
> Dennis Kubes
>
> Josh Attenberg wrote:
> > can anyone tell me how to control how many segments get created by
> generate?
> > I tried mergesegs, but that always seems to crash.
> >
> > On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]>
> wrote:
> >
> >> regarding point #1 - in nutch 7 & 8, how can one specify the number of
> >> urls that go into a segment? I guess an equivalent question is, how to
> >> control how many segments get created by generate?
> >>
> >>  On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> >>
> >>> you can delete everything in segments except crawl_generate then start
> >>> the fetch again.  Here are some hints about fetching:
> >>>
> >>> 1) if you can do smaller fetches in the 1-5 million page range, then
> >>> aggregate together.  that way if something goes wrong you haven't lost
> >>> alot.
> >>>
> >>> 2) setting generate.max.per.host to some a small number instead of -1
> >>> will make fetches run faster but won't get all pages from a single
> site.
> >>>  This is good if you are doing general web crawls.
> >>>
> >>> 3) avoid using regex-urlfilter if possible, prefix and suffix url
> filter
> >>> tend to work much better and don't cause stalls.
> >>>
> >>> 4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid
> >>> pages with long crawl delays.  again this makes fetching go faster.
> >>>
> >>> Hope this helps.
> >>>
> >>> Dennis Kubes
> >>>
> >>> Josh Attenberg wrote:
> >>>> I was fetching a large segment, ~30million urls, when something weird
> >>>> happened and the fetcher crashed. I know I can't recover the portion
> >>> of what
> >>>> I have fetched already, but i'd like to start over. When I try to do
> >>> this,
> >>>> I get the following error:  already exists:
> >>> segments/20071214095336/fetcher
> >>>> what can I do to re-try this? can i just delete that file and try
> >>> over? I
> >>>> have had nothing but trouble performing large fetches with nutch, but
> >>> I
> >>>> can't give up! please help!
> >>>>
> >>
> >
>

Reply via email to