Hi Chris,

You should really start all the slave nodes to be sure that you don't
lose data. If you start fewer than #nodes - #replication + 1 nodes
then you are virtually guaranteed to lose blocks. Starting 6 nodes out
of 10 will cause the filesystem to remain in safe mode, as you've
seen.

BTW I'm just created a Jira for EBS support
(https://issues.apache.org/jira/browse/HADOOP-6108) which you might be
interested in.

Cheers,
Tom

On Thu, Jun 25, 2009 at 3:51 PM, Chris Curtin<curtin.ch...@gmail.com> wrote:
> Hi,
>
> I am using 0.19.0 on EC2. The Hadoop execution and HDFS directories are on
> EBS volumes mounted to each node in my EC2 cluster. Only the install of
> hadoop is in the AMI. We have 10 EBS volumes and when the cluster starts it
> randomly picks one for each slave. We don't always start all 10 slaves
> depending on what type of work we are going to do.
>
> Every third or fourth start of the cluster the namenode goes into safemode
> and won't come out automatically. Restarting datanodes and task trackers on
> each of the slaves doesn't help. Not much in the log files besides the error
> about waiting for the available %. Forcing it out of safe mode allows the
> cluster to start working.
>
> My only thought is that something is being stored on one of the EBS volumes
> not being mounted when starting a smaller configuration (say 6 nodes instead
> of 10). But isn't HDFS fault tolerant so that if there is a missing node it
> carries on?
>
> Any advice on why the namenode and datanodes can't find all the data blocks?
> Or where to look for more information about what might be going on?
>
> Thanks,
>
> Chris
>

Reply via email to