hdfs question when replacing dead node...

Andy Sautins Thu, 23 Jul 2009 15:01:16 -0700

   I recently had to replace a node on a hadoop 0.20.0 4-node cluster and I 
can't quite explain what happened.  If anyone has any insight I'd appreciate it.


   When the node failed ( drive failure ) running the command 'hadoop fsck /' 
correctly showed the data nodes to now be 3 instead of 4 and showed the under 
replicated blocks to be replicated.  I assume that once the node was determined 
to be dead the blocks on the dead node were not considered in the replication 
factor and caused hdfs to replicate to the available nodes to meet the 
configured replication factor of 3.  All is good.  What I couldn't explain is 
that after re-building and re-starting the failed node I started the balancer ( 
bin/start-balancer.sh ) and re-ran 'hadoop fsck /'.  The number of nodes showed 
that the 4th node was now back in the cluster.  What struck me as strange is a 
large number of blocks ( > 2k ) were shown as under replicated.  The under 
replicated blocks were eventually re-replicated and all the data seems correct.

   Can someone explain why re-adding a node that had died why the replication 
factor would go from 3 to 2?  Is there something with the balancer.sh script 
that would show fsck that the blocks are under replicated?

   Note that I'm still getting the process for replacing failed nodes down so 
it's possible that I was looking at things wrong for a bit.

    Any insight would be greatly appreciated.

    Thanks

    Andy

hdfs question when replacing dead node...

Reply via email to