[ 
https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662054#action_12662054
 ] 

Suresh Srinivas commented on HADOOP-4586:
-----------------------------------------

Sorry for the late comments:
For a master/slave HA solution, two main problems are:
1. Mechanism that determines a master in a cluster during startup and failover. 
Handling loss of quorum, split-brain and fencing in case of split-brain. It 
also requires comprehensive management tools for configuring, managing and 
monitoring cluster.
2. Sharing state information between master and slave, so that a slave node can 
take over as master.

Currenly the proposed solution addresses mainly the second problem. I have not 
seen much information on how the first problem is addressed. While the sharing 
of information between master and slave can be done in many ways, managing the 
master/slave cluster is a more complicated problem. Could you please add more 
information on how the design handles these issues and some notes on how 
administrator uses this functionality to manage the cluster.

Also analysis of the impact of job tracker performance due to the introduction 
of this feature needs to be done.

> Has anyone explored refactoring the job tracker to use Zookeeper instead of 
> engineering a new master/slave replication system?
Storing the jobtracker state on Zookeeper may not be a viable option, given 
that ZooKeeper is intended for storing small amount of data (KB) and JobTracker 
has lot more data than that to persist.

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding 
> fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: 
> http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in 
> my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to