Re: Question about fault tolerance and fail over for name nodes

Andreas Kostyrka Tue, 29 Jul 2008 12:17:14 -0700

On Tuesday 29 July 2008 18:22:07 Paco NATHAN wrote:
> Jason,
>
> FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in
> sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time.
>
> We tend to see a namenode failing early, e.g., the "problem advancing"
> exception in the values iterator, particularly during a reduce phase.
>
> Hot-fail would be great. Otherwise, given the duration of our batch
> job overall, we use what you describe: shut down cluster, etc.
>
> Would prefer to observe this kind of failure sooner than later. We've
> discussed internally how to craft an initial job which could stress
> test the namenode.  Think of a "unit test" for the cluster.


ssh namenode 'kill -9 $(ps ax | grep java.*NameNode | cut -f 1 -d " " )'

Here goes your namenode failure, if you just want to do the exercise for a 
failover ;)

Andreas

>
> The business case for this becomes especially important when you need
> to automate the Hadoop cluster launch, e.g. with RightScale or another
> "cloud enabler" service.
>
> Anybody else heading in this direction?
>
> Paco
>
> On Tue, Jul 29, 2008 at 11:01 AM, Jason Venner <[EMAIL PROTECTED]> wrote:
> > What are people doing?
> >
> > For jobs that have a long enough SLA, just shutting down the cluster and
> > bringing up the secondary as the master works for us.
> > We have some jobs where that doesn't work well, because the recovery time
> > is not acceptable.
> >
> > There has been internal discussion of using drdb to hotfail a namenode to
> > a backup so that the running job can continue.

signature.asc
Description: This is a digitally signed message part.

Re: Question about fault tolerance and fail over for name nodes

Reply via email to