You can't. Crawls are self contained. You can restart them by removing all folders under the segments/xxxx/* directories except the crawl_generate and then reexecuting the fetch job. But there isn't a way to restart a crawl job from a mid checkpoint.

Dennis

Sherjeel Niazi wrote:
Hi,

I am using Nutch 0.9
I am crawling a series of URL's of a website but after some time the crawler
crash with the following error:

Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:62)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:128)

How can I resume the crawler where it ends?


Sherjeel

Reply via email to