On Tuesday 29 July 2008 18:22:07 Paco NATHAN wrote: > Jason, > > FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in > sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time. > > We tend to see a namenode failing early, e.g., the "problem advancing" > exception in the values iterator, particularly during a reduce phase. > > Hot-fail would be great. Otherwise, given the duration of our batch > job overall, we use what you describe: shut down cluster, etc. > > Would prefer to observe this kind of failure sooner than later. We've > discussed internally how to craft an initial job which could stress > test the namenode. Think of a "unit test" for the cluster.
ssh namenode 'kill -9 $(ps ax | grep java.*NameNode | cut -f 1 -d " " )' Here goes your namenode failure, if you just want to do the exercise for a failover ;) Andreas > > The business case for this becomes especially important when you need > to automate the Hadoop cluster launch, e.g. with RightScale or another > "cloud enabler" service. > > Anybody else heading in this direction? > > Paco > > On Tue, Jul 29, 2008 at 11:01 AM, Jason Venner <[EMAIL PROTECTED]> wrote: > > What are people doing? > > > > For jobs that have a long enough SLA, just shutting down the cluster and > > bringing up the secondary as the master works for us. > > We have some jobs where that doesn't work well, because the recovery time > > is not acceptable. > > > > There has been internal discussion of using drdb to hotfail a namenode to > > a backup so that the running job can continue.
signature.asc
Description: This is a digitally signed message part.
