Re: Restarting Nutch

Sami Siren Wed, 18 Feb 2009 05:36:17 -0800

[moving this to nutch-user]

Hrishikesh Agashe wrote:

Hi,

I am planning to do a huge crawl using Nutch (billions of URLs) and so need
to understand whether Nutch can handle restarts after a crash.

For single system, if I do Ctrl+C while Nutch is running and then restart
it, will it be possible for Nutch to detect where it has reached in last run
and start from that point onwards? Or will it be considered as new fresh
crawl?

Nutch does not try to resume the action that was interrupted.

Also if I have 5 nodes running Nutch and doing the crawling, if one of the
node fails, should it be considered as total failure of Nutch itself? Or
should I allow other nodes to proceed further? Will I loose data gathered by
the failed node?

Hadoop will execute the remaining tasks at nodes that are available.Usually data will be stored on a shared/distributed filesystem (likeHDFS). If your setup is similar and you ensure that the filesystem cansurvive single node failures your data should be safe.


--
Sami Siren

Re: Restarting Nutch

Reply via email to