[ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719980#action_12719980 ]
Sharad Agarwal commented on HADOOP-4586: ---------------------------------------- If I understand correctly, the current patch doesn't share the state between master and slaves. It relies on HADOOP-3245 for keeping the state. I assume this to work the state has to be kept on HDFS instead of local filesystem. In case a new master is elected, the jobtracker is started using the state from HDFS, right? Also, reading the master info from HDFS at frequent interval from each node may not scale well. I think Zookeeper would be better suited in the case where we are just doing master election and keeping watch on the master changes. > Fault tolerant Hadoop Job Tracker > --------------------------------- > > Key: HADOOP-4586 > URL: https://issues.apache.org/jira/browse/HADOOP-4586 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Environment: High availability enterprise system > Reporter: Francesco Salbaroli > Assignee: Francesco Salbaroli > Fix For: 0.21.0 > > Attachments: Enhancing the Hadoop MapReduce framework by adding > fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, > HADOOP-4586v0.3.patch, jgroups-all.jar > > > The Hadoop framework has been designed, in an eort to enhance perfor- > mances, with a single JobTracker (master node). It's responsibilities varies > from managing job submission process, compute the input splits, schedule > the tasks to the slave nodes (TaskTrackers) and monitor their health. > In some environments, like the IBM and Google's Internet-scale com- > puting initiative, there is the need for high-availability, and performances > becomes a secondary issue. In this environments, having a system with > a Single Point of Failure (such as Hadoop's single JobTracker) is a major > concern. > My proposal is to provide a redundant version of Hadoop by adding > support for multiple replicated JobTrackers. This design can be approached > in many dierent ways. > In the document at: > http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0 > I wrote an overview of the problem and some approaches to solve it. > I post this to the community to gather feedback on the best way to proceed in > my work. > Thank you! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.