Hi, I am using 0.19.0 on EC2. The Hadoop execution and HDFS directories are on EBS volumes mounted to each node in my EC2 cluster. Only the install of hadoop is in the AMI. We have 10 EBS volumes and when the cluster starts it randomly picks one for each slave. We don't always start all 10 slaves depending on what type of work we are going to do.
Every third or fourth start of the cluster the namenode goes into safemode and won't come out automatically. Restarting datanodes and task trackers on each of the slaves doesn't help. Not much in the log files besides the error about waiting for the available %. Forcing it out of safe mode allows the cluster to start working. My only thought is that something is being stored on one of the EBS volumes not being mounted when starting a smaller configuration (say 6 nodes instead of 10). But isn't HDFS fault tolerant so that if there is a missing node it carries on? Any advice on why the namenode and datanodes can't find all the data blocks? Or where to look for more information about what might be going on? Thanks, Chris