Re: hdfs question when replacing dead node...

Aaron Kimball Thu, 23 Jul 2009 19:22:05 -0700

How fast did you re-run fsck after re-joining the node? fsck returns data
based on the latest block reports from datanodes -- these are scheduled to
run (I think) every 15 minutes, so the NameNode's state on block replication
may be as much as 15 minutes out of date.


- Aaron

On Thu, Jul 23, 2009 at 3:00 PM, Andy Sautins
<[email protected]>wrote:

>
>   I recently had to replace a node on a hadoop 0.20.0 4-node cluster and I
> can't quite explain what happened.  If anyone has any insight I'd appreciate
> it.
>
>   When the node failed ( drive failure ) running the command 'hadoop fsck
> /' correctly showed the data nodes to now be 3 instead of 4 and showed the
> under replicated blocks to be replicated.  I assume that once the node was
> determined to be dead the blocks on the dead node were not considered in the
> replication factor and caused hdfs to replicate to the available nodes to
> meet the configured replication factor of 3.  All is good.  What I couldn't
> explain is that after re-building and re-starting the failed node I started
> the balancer ( bin/start-balancer.sh ) and re-ran 'hadoop fsck /'.  The
> number of nodes showed that the 4th node was now back in the cluster.  What
> struck me as strange is a large number of blocks ( > 2k ) were shown as
> under replicated.  The under replicated blocks were eventually re-replicated
> and all the data seems correct.
>
>   Can someone explain why re-adding a node that had died why the
> replication factor would go from 3 to 2?  Is there something with the
> balancer.sh script that would show fsck that the blocks are under
> replicated?
>
>   Note that I'm still getting the process for replacing failed nodes down
> so it's possible that I was looking at things wrong for a bit.
>
>    Any insight would be greatly appreciated.
>
>    Thanks
>
>    Andy
>

Re: hdfs question when replacing dead node...

Reply via email to