On Thu, Mar 3, 2011 at 6:44 PM, Mark Kerzner <[email protected]> wrote:
> Hi, > > in my small development cluster I have a master/slave node and a slave > node, > and I shut down the slave node at night. I often see that my HDFS is > corrupted, and I have to reformat the name node and to delete the data > directory. > Why do you shut down the slave at night? HDFS should only be corrupted if you're missing all copies of a block. With a replication factor of 3 (default) you should have 100% of the data on both nodes (if you only have 2 nodes). If you've dialed it down to 1, simply starting the slave back up should "un-corrupt" HDFS. You definitely don't want to be doing this to HDFS regularly (dropping nodes from the cluster and re-adding them unless you're trying to test HDFS' failure semantics. It finally dawns on me that with such small cluster I better shut the > daemons down, for otherwise they are trying too hard to compensate for the > missing node and eventually it goes bad. Is my understanding correct? > It doesn't "eventually go bad." If the NN sees a DN disappear it may start re-replicating data to another node. In such a small cluster, maybe there's no where else to get the blocks from, but I bet you dialed the replication factor down to 1 (or have code that writes files with a rep factor of 1 like teragen / terasort). In short, if you're going to shut down nodes like this put the NN into safe mode so it doesn't freak out (which will also make the cluster unusable during that time) but there's definitely no need to be reformatting HDFS. Just re-introduce the DN you shut down to the cluster. > > Thank you, > Mark > -- Eric Sammer twitter: esammer data: www.cloudera.com
