regarding point #1 - in nutch 7 & 8, how can one specify the number of urls
that go into a segment? I guess an equivalent question is, how to control
how many segments get created by generate?

On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> you can delete everything in segments except crawl_generate then start
> the fetch again.  Here are some hints about fetching:
>
> 1) if you can do smaller fetches in the 1-5 million page range, then
> aggregate together.  that way if something goes wrong you haven't lost
> alot.
>
> 2) setting generate.max.per.host to some a small number instead of -1
> will make fetches run faster but won't get all pages from a single site.
>  This is good if you are doing general web crawls.
>
> 3) avoid using regex-urlfilter if possible, prefix and suffix url filter
> tend to work much better and don't cause stalls.
>
> 4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid
> pages with long crawl delays.  again this makes fetching go faster.
>
> Hope this helps.
>
> Dennis Kubes
>
> Josh Attenberg wrote:
> > I was fetching a large segment, ~30million urls, when something weird
> > happened and the fetcher crashed. I know I can't recover the portion of
> what
> > I have fetched already, but i'd like to start over. When I try to do
> this,
> > I get the following error:  already exists:
> segments/20071214095336/fetcher
> >
> > what can I do to re-try this? can i just delete that file and try over?
> I
> > have had nothing but trouble performing large fetches with nutch, but I
> > can't give up! please help!
> >
>

Reply via email to