Re: Question about fault tolerance and fail over for name nodes

Steve Loughran Wed, 30 Jul 2008 03:39:51 -0700

Andreas Kostyrka wrote:

On Tuesday 29 July 2008 18:22:07 Paco NATHAN wrote:

Jason,


FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in
sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time.

We tend to see a namenode failing early, e.g., the "problem advancing"
exception in the values iterator, particularly during a reduce phase.

Hot-fail would be great. Otherwise, given the duration of our batch
job overall, we use what you describe: shut down cluster, etc.

Would prefer to observe this kind of failure sooner than later. We've
discussed internally how to craft an initial job which could stress
test the namenode.  Think of a "unit test" for the cluster.


ssh namenode 'kill -9 $(ps ax | grep java.*NameNode | cut -f 1 -d " " )'

Here goes your namenode failure, if you just want to do the exercise for afailover ;)

Simulating network partitioning can be more interesting, as then yourfailover tools have to deal with the risk that there are now twomachines that think they are in charge. This is why buildingHigh-Availability and fault-tolerant systems are tricky.


--
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: Question about fault tolerance and fail over for name nodes

Reply via email to