Web crawling questions, effects of repeating a stage of the crawl

Charlie Williams Tue, 20 Feb 2007 13:06:32 -0800

I am a new nutch user and hopeful future dev, but so far I am mainly focus
on learning to use nutch, then later delving into the code.


I am using nutch 0.8.1 release under red hat linux enterprise 4

I am curious what the effects are of running a stage of the crawl processing
more than once? I ask this because several times now I have started a
restricted internet crawl, to find several days later it crashes for an
unknown reason on the map reduce at the end of the fetch cycle. The logs do
not indicate the reason for the crash, and intermediate files (the cached
pages) are lost. I'd like to restart the fetch from the last iteration, but
am worried that the partial fetch may have damaged the crawldb.

Basically I'd like to know the effects of restarting the cycle, generating,
fetching, etc. when previously the cycle did not complete through to the end
of updating the crawldb.

Thanks,

-Charlie Williams

Web crawling questions, effects of repeating a stage of the crawl

Reply via email to