Hi Chris, You should really start all the slave nodes to be sure that you don't lose data. If you start fewer than #nodes - #replication + 1 nodes then you are virtually guaranteed to lose blocks. Starting 6 nodes out of 10 will cause the filesystem to remain in safe mode, as you've seen.
BTW I'm just created a Jira for EBS support (https://issues.apache.org/jira/browse/HADOOP-6108) which you might be interested in. Cheers, Tom On Thu, Jun 25, 2009 at 3:51 PM, Chris Curtin<curtin.ch...@gmail.com> wrote: > Hi, > > I am using 0.19.0 on EC2. The Hadoop execution and HDFS directories are on > EBS volumes mounted to each node in my EC2 cluster. Only the install of > hadoop is in the AMI. We have 10 EBS volumes and when the cluster starts it > randomly picks one for each slave. We don't always start all 10 slaves > depending on what type of work we are going to do. > > Every third or fourth start of the cluster the namenode goes into safemode > and won't come out automatically. Restarting datanodes and task trackers on > each of the slaves doesn't help. Not much in the log files besides the error > about waiting for the available %. Forcing it out of safe mode allows the > cluster to start working. > > My only thought is that something is being stored on one of the EBS volumes > not being mounted when starting a smaller configuration (say 6 nodes instead > of 10). But isn't HDFS fault tolerant so that if there is a missing node it > carries on? > > Any advice on why the namenode and datanodes can't find all the data blocks? > Or where to look for more information about what might be going on? > > Thanks, > > Chris >