On Thu, Jul 30, 2009 at 12:38 AM, bhushan_mahale < [email protected]> wrote:
> > Thanks Todd for the reply. > > A few more queries: > If the dead node comes up (with different IP and RSA key than before), can > we use the data it has? > Yes - the storage directory holds a unique ID in dfs.data.dir/current/VERSION so changing its IP address should be fine. > If we just make the password-less login to work; would the data on that > node be useful without the need for formatting the namenode? > Not sure what exactly you mean by this. There's no need to reformat the namenode if datanodes die. As for passwordless login, that is a convenience used *only* by the start-*.sh scripts. Hadoop itself does not rely on SSH in any way. -Todd -----Original Message----- > From: Todd Lipcon [mailto:[email protected]] > Sent: Wednesday, July 29, 2009 11:58 PM > To: [email protected] > Subject: Re: To retrieve data on dead node > > On Wed, Jul 29, 2009 at 8:51 AM, bhushan_mahale < > [email protected]> wrote: > > > Hi, > > > > What are the possible ways to retrieve the data if a node goes down in a > > Hadoop cluster? > > > > Assuming replication factor as 3, and 3 nodes goes down in a 10 node > > cluster, how do we retrieve the data? > > > > Hi Bhushan, > > If 3 nodes go down at the same time, some of your data will become > inaccessible. If you cannot recover at least one of those nodes, you will > have no way to recover the data. If you can recover at least one, then the > blocks will become available at replication count 1. The NN will notice the > underreplicated blocks and trigger rereplication to get them back up to 3. > > If your nodes fail one-by-one with some time in between, the NN should have > time to trigger rereplication between them and the blocks will never be > inaccessible. > > In general, simultaneous failures occur in two ways in the datacenter: one > is that the entire datacenter has lost power (or forced shutdown due to > lost > cooling). In this case, no amount of replication within the DC will help. > The other failure is that power (or network) is lost to an entire rack, > either due to a switch failure or a failed PDU. If you've configured > Hadoop's rack-awareness, it will ensure that each block is replicated on at > least two racks to mitigate the downside of a rack loss. > > Depending on your particular setup, it may be worth putting your 10-node > cluster spread across separate power circuits and configuring them as > separate "racks" in Hadoop, if you're concerned about flaky rack PDUs. > > Hope that helps > -Todd > > > > Thanks, > > - Bhushan > > > > > > DISCLAIMER > > ========== > > This e-mail may contain privileged and confidential information which is > > the property of Persistent Systems Ltd. It is intended only for the use > of > > the individual or entity to which it is addressed. If you are not the > > intended recipient, you are not authorized to read, retain, copy, print, > > distribute or use this message. If you have received this communication > in > > error, please notify the sender and delete all copies of this message. > > Persistent Systems Ltd. does not accept any liability for virus infected > > mails. > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. >
