On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:
In the original MapReduce paper from Google, it mentioned that healthy workers can take over failed task from other workers. Does Hadoop has the same failure recovery strategy?
Yes. If a task fails on one node, it is assigned to another free node automatically.
Also the other question is, in the paper, it seems the nodes can be added/removed while the cluster is running jobs. How does Hadoop achieve this? Since the slave locations are saved in the file and the master doesn't know about new nodes until it restart and reload the slave list.
The slaves file is only used by the startup scripts when bringing up the cluster. If additional data nodes or task trackers (ie. slaves) are started they automatically join the cluster and will be given work. If the servers on one of the slaves are killed, the work will be redone on other nodes.
-- Owen
