I am a new nutch user and hopeful future dev, but so far I am mainly focus on learning to use nutch, then later delving into the code.
I am using nutch 0.8.1 release under red hat linux enterprise 4 I am curious what the effects are of running a stage of the crawl processing more than once? I ask this because several times now I have started a restricted internet crawl, to find several days later it crashes for an unknown reason on the map reduce at the end of the fetch cycle. The logs do not indicate the reason for the crash, and intermediate files (the cached pages) are lost. I'd like to restart the fetch from the last iteration, but am worried that the partial fetch may have damaged the crawldb. Basically I'd like to know the effects of restarting the cycle, generating, fetching, etc. when previously the cycle did not complete through to the end of updating the crawldb. Thanks, -Charlie Williams
