Re: MRv1 JT Availability (was [DISCUSS] Spin out MR, HDFS and YARN ...)

Eric Baldeschwieler Mon, 17 Sep 2012 23:55:04 -0700

> I just want to be more clear in what I meant by "HA JobTracker for
> parity with HDFS". There should be no need to quiesce the JT with a
> highly available NameNode, and restarting jobs from the beginning if
> the JT crashes isn't good enough to meet the user expectations implied
> by "high availability", at least those who are our internal customers.

Hi Andrew.

A couple of points...

1) Quiescing the JT can be slightly refined, but the focus there is to have a 
reasonable behavior if the storage layer becomes unavailable or has not been 
started in a boot sequence.  This is useful functionality that simply addresses 
a different set of failure cases. 

2) I agree that restarting jobs is not desirable.  This is an independent issue 
we've been working on in YARN.  The key here is simply sorting out how you 
manage state efficiently on ZK or HDFS.  The good news is HBase demonstrates 
how this can be done (region servers and master designs.

> I meant hot JT failover, that there is a primary and backup JT, that
> they share state sufficient for the backup to take over immediately if
> the primary fails, and that the TTs and JobClients both will switch
> seamlessly to the backup should their communications with the primary
> fail.

I think state sharing is very expensive and error prone.  These kind of hot-hot 
solutions are almost an anti-patern IMO.  In the case of HDFS we are half way 
through implementing this, so we don't need to reopen that.  One can argue that 
HBase and HDFS might need them, given the desire for MANY very low latency 
requests,  But HBase hasn't opted for this complexity yet I'd observe and I'm 
more tempted to emulate its designs than HDFS's for MR.

For MR, a good simple cold failover design should be MUCH easier to implement 
and debug and maintain.  Running jobs need not be lost (their state can be 
stored in durable storage or recovered from the cluster) and the time to detect 
failure should end up dominating the time to recover, much like what we are 
seeing in HDFS testing.  So for small clusters there should be zero reason to 
do hot-hot.

I think we are much better off focusing on simple design patterns using the 
storage systems we have (ZK and HDFS) to restore state quickly on failover.  
The HBase region server and masters are good examples of good design in this 
area that we should emulate here IMO.  MR has much simpler problems and any 
investment we make in improving WALs and state management on HDFS is going to 
make HBase and every new compute model ported to YARN better.

On Sep 9, 2012, at 10:57 AM, Andrew Purtell <[email protected]> wrote:

> Hi Arun,
> 
> On Mon, Sep 3, 2012 at 4:02 AM Arun C Murthy wrote:
>>> On Sep 1, 2012, at 6:32 AM, Andrew Purtell wrote:
>>> I'd imagine such a MR(v1) in Hadoop, if this happened, would concentrate on
>>> performance improvements, maybe such things as alternate shuffle plugins.
>>> Perhaps a HA JobTracker for parity with HDFS.
>> 
>> Lots of this has already happened in branch-1, please look at:
>> # JT Availability: MAPREDUCE-3837, MAPREDUCE-4328, MAPREDUCE-4603 (WIP)
> 
> Thanks for the pointers!
> 
> I just want to be more clear in what I meant by "HA JobTracker for
> parity with HDFS". There should be no need to quiesce the JT with a
> highly available NameNode, and restarting jobs from the beginning if
> the JT crashes isn't good enough to meet the user expectations implied
> by "high availability", at least those who are our internal customers.
> I meant hot JT failover, that there is a primary and backup JT, that
> they share state sufficient for the backup to take over immediately if
> the primary fails, and that the TTs and JobClients both will switch
> seamlessly to the backup should their communications with the primary
> fail. I'd expect state sharing to limit scalability to the small- and
> medium-cluster range, and that's fine, YARN is the answer for
> scalability issues in the large and largest clusters already.
> 
>> # Performance - backports of PureJavaCrc32 in spills (MAPREDUCE-782), 
>> fadvise backports (MAPREDUCE-3289) and other several misc. fixes.
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)

Re: MRv1 JT Availability (was [DISCUSS] Spin out MR, HDFS and YARN ...)

Reply via email to