[ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662309#action_12662309 ]
Francesco Salbaroli commented on HADOOP-4586: --------------------------------------------- >Sorry for the late comments: >For a master/slave HA solution, two main problems are: >1. Mechanism that determines a master in a cluster during startup and failover. The JGroups library (whose manual can be found here: [http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf] ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected. >Handling loss of quorum, The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update. >split-brain and fencing in case of split-brain. The JGroups library tries to automatically handle network partitions and merging, but given that: *There is no shared soft-state *There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode) the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added. > It also >requires comprehensive management tools for configuring, managing > and monitoring cluster. I am now adding JMX support. After the initial testing phase I will post result and an updated version. >2. Sharing state information between master and slave, so that a slave node >can take over as master. >Currenly the proposed solution addresses mainly the second problem. I have not >seen much information on how the first problem is addressed. While the sharing >>of information between master and slave can be done in many ways, managing >the master/slave cluster is a more complicated problem. Could you please add >>more information on how the design handles these issues and some notes on how >administrator uses this functionality to manage the cluster. I hope I have given an answer to your question. If you need more, feel free to contact me. >Also analysis of the impact of job tracker performance due to the introduction >of this feature needs to be done. I am about to begin the testing phase, results will follow Regards, Francesco > Fault tolerant Hadoop Job Tracker > --------------------------------- > > Key: HADOOP-4586 > URL: https://issues.apache.org/jira/browse/HADOOP-4586 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.18.0, 0.18.1, 0.18.2 > Environment: High availability enterprise system > Reporter: Francesco Salbaroli > Assignee: Francesco Salbaroli > Fix For: 0.21.0 > > Attachments: Enhancing the Hadoop MapReduce framework by adding > fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar > > > The Hadoop framework has been designed, in an eort to enhance perfor- > mances, with a single JobTracker (master node). It's responsibilities varies > from managing job submission process, compute the input splits, schedule > the tasks to the slave nodes (TaskTrackers) and monitor their health. > In some environments, like the IBM and Google's Internet-scale com- > puting initiative, there is the need for high-availability, and performances > becomes a secondary issue. In this environments, having a system with > a Single Point of Failure (such as Hadoop's single JobTracker) is a major > concern. > My proposal is to provide a redundant version of Hadoop by adding > support for multiple replicated JobTrackers. This design can be approached > in many dierent ways. > In the document at: > http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0 > I wrote an overview of the problem and some approaches to solve it. > I post this to the community to gather feedback on the best way to proceed in > my work. > Thank you! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.