[moving this to nutch-user]
Hrishikesh Agashe wrote:
I am planning to do a huge crawl using Nutch (billions of URLs) and so need
to understand whether Nutch can handle restarts after a crash.
For single system, if I do Ctrl+C while Nutch is running and then restart
it, will it be possible for Nutch to detect where it has reached in last run
and start from that point onwards? Or will it be considered as new fresh
Nutch does not try to resume the action that was interrupted.
Hadoop will execute the remaining tasks at nodes that are available.
Usually data will be stored on a shared/distributed filesystem (like
HDFS). If your setup is similar and you ensure that the filesystem can
survive single node failures your data should be safe.
Also if I have 5 nodes running Nutch and doing the crawling, if one of the
node fails, should it be considered as total failure of Nutch itself? Or
should I allow other nodes to proceed further? Will I loose data gathered by
the failed node?