[ 
https://issues.apache.org/jira/browse/MAPREDUCE-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983639#action_12983639
 ] 

Hari A V commented on MAPREDUCE-225:
------------------------------------

Hi,

In my team, we also have been analysing on how to provide HA for Job Tracker. 
Our approach is also quite similar to Francesco's approach. 

The complete HA solution can be divided to three aspects

1. Sharing of job related state between Master and Slave job trackers

        This can be achieved with issues HADOOP-1876 and HADOOP-3245. 

2. Failure Detection and Master Election
        
        We are preferring Zookeeper for this. We had quite bad experience with 
JGroups in some of our previous projects which include Deadlocks, network 
traffic overhead etc (May be latest version of JGroups is stable). We were 
forced to replace jgroups. Zookeeper is the best solution available for leader 
election. We have seen that Zookeeper is very well used in similar situations 
in "Katta" project and also some of our internal projects.

3. How to Notify JobClients and Task Trackers about the new Master, on failure. 
        One option would be DNS as mentioned. 
        Another option is providing a list of job tracker ips to JobClients and 
Task trackers. They can silently retry on all available ips in case of failure. 
At the server side, slave job trackers will not accept any service request. 
This way we can avoid split brain and network partition scenarios. Zookeeper 
cluster inherently avoids the split brain issues in leader election.

We have not yet started our work. Please provide your valuable opinions. 

thanks
Hari


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: MAPREDUCE-225
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-225
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>         Attachments: Enhancing the Hadoop MapReduce framework by adding 
> fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, 
> HADOOP-4586v0.3.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: 
> http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in 
> my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to