Hey Andrew, I know not the answers to all your questions, but the https://issues.apache.org/jira/browse/MAPREDUCE-2288 JIRA serves as a good umbrella we can use to track this overall (there seems to have been multiple approaches presented over time).
The closest I found to your rumor note was https://issues.apache.org/jira/browse/MAPREDUCE-2648, but it lacks job state maintenance (i.e. provides no resuming of jobs post failover). I did not dig too deep, however. On Sun, Jun 17, 2012 at 3:53 AM, Andrew Purtell <apurt...@apache.org> wrote: > We are planning to run a next generation of Hadoop ecosystem components in > our production in a few months. We plan to use HDFS 2.0 for the HA NameNode > work. The platform will also include YARN but its use will be experimental. > So we'll be running something equivalent to the CDH MR1 package to support > production workloads for I'd guess a year. > > We have heard a rumor regarding the existence of a version of the MR1 > Jobtracker that persists state to Zookeeper such that failover to a new > instance is fast and doesn't lose job state. I'd like to be aspirational and > aim for a HA MR1 Jobtracker to complement the HA namenode. Even if no such > existing code is available, we might adapt existing classes in the MR1 > Jobtracker to models/proxies of state in zookeeper. For clusters of our size > (in the 100s of nodes range) this could be workable. Also, the MR client > could possibly use ZK for failover like the HDFS client. > > I'm trying to find out first the availability of such code if anyone knows. > Otherwise, we may try building this, and so also I'd like to get a sense of > any interest in usage or dev collaboration. > > Best regards, > > - Andy > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > -- Harsh J