Andreas Kostyrka wrote:
On Tuesday 29 July 2008 18:22:07 Paco NATHAN wrote:
Jason,
FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in
sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time.
We tend to see a namenode failing early, e.g., the "problem advancing"
exception in the values iterator, particularly during a reduce phase.
Hot-fail would be great. Otherwise, given the duration of our batch
job overall, we use what you describe: shut down cluster, etc.
Would prefer to observe this kind of failure sooner than later. We've
discussed internally how to craft an initial job which could stress
test the namenode. Think of a "unit test" for the cluster.
ssh namenode 'kill -9 $(ps ax | grep java.*NameNode | cut -f 1 -d " " )'
Here goes your namenode failure, if you just want to do the exercise for a
failover ;)
Simulating network partitioning can be more interesting, as then your
failover tools have to deal with the risk that there are now two
machines that think they are in charge. This is why building
High-Availability and fault-tolerant systems are tricky.
--
Steve Loughran http://www.1060.org/blogxter/publish/5
Author: Ant in Action http://antbook.org/