[moving this to nutch-user]

Hrishikesh Agashe wrote:
Hi,

I am planning to do a huge crawl using Nutch (billions of URLs) and so need
to understand whether Nutch can handle restarts after a crash.

For single system, if I do Ctrl+C while Nutch is running and then restart
it, will it be possible for Nutch to detect where it has reached in last run
and start from that point onwards? Or will it be considered as new fresh
crawl?
Nutch does not try to resume the action that was interrupted.
Also if I have 5 nodes running Nutch and doing the crawling, if one of the
node fails, should it be considered as total failure of Nutch itself? Or
should I allow other nodes to proceed further? Will I loose data gathered by
the failed node?
Hadoop will execute the remaining tasks at nodes that are available. Usually data will be stored on a shared/distributed filesystem (like HDFS). If your setup is similar and you ensure that the filesystem can survive single node failures your data should be safe.

--
Sami Siren

Reply via email to