[ 
https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662309#action_12662309
 ] 

Francesco Salbaroli commented on HADOOP-4586:
---------------------------------------------

>Sorry for the late comments:
>For a master/slave HA solution, two main problems are:
>1. Mechanism that determines a master in a cluster during startup and failover.

The JGroups library (whose manual can be found here: 
[http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf] ) handles 
automatically the election of a group coordinator. The node elected group 
coordinator is also the master of the cluster. In case of a failure a new group 
coordinator (and, consequentially, a new cluster master) will be elected.

>Handling loss of quorum, 

The shared state resides entirely on HDFS (see issues HADOOP-1876 and 
HADOOP-3245) so, until now, there is no shared soft-state between nodes. 
However the facilities for managing a shared state are present and can be used 
in a future update.

>split-brain and fencing in case of split-brain.

The JGroups library tries to automatically handle network partitions and 
merging, but given that:

*There is no shared soft-state
*There is only one access point in the whole Hadoop cluster to the HDFS (the 
NameNode)

the network partition problem should not be an issue (only one partition at a 
time can access the HDFS). In future versions a more elegant way of dealing 
with network partitions should be added.

> It also >requires comprehensive management tools for configuring, managing 
> and monitoring cluster.

I am now adding JMX support. After the initial testing phase I will post result 
and an updated version.

>2. Sharing state information between master and slave, so that a slave node 
>can take over as master.
>Currenly the proposed solution addresses mainly the second problem. I have not 
>seen much information on how the first problem is addressed. While the sharing 
>>of information between master and slave can be done in many ways, managing 
>the master/slave cluster is a more complicated problem. Could you please add 
>>more information on how the design handles these issues and some notes on how 
>administrator uses this functionality to manage the cluster.

I hope I have given an answer to your question. If you need more, feel free to 
contact me.

>Also analysis of the impact of job tracker performance due to the introduction 
>of this feature needs to be done.

I am about to begin the testing phase, results will follow

Regards,
Francesco

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding 
> fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: 
> http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in 
> my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to