On Thu, Jul 30, 2009 at 12:38 AM, bhushan_mahale <
[email protected]> wrote:

>
> Thanks Todd for the reply.
>
> A few more queries:
> If the dead node comes up (with different IP and RSA key than before), can
> we use the data it has?
>

Yes - the storage directory holds a unique ID in
dfs.data.dir/current/VERSION so changing its IP address should be fine.


> If we just make the password-less login to work; would the data on that
> node be useful without the need for formatting the namenode?
>

Not sure what exactly you mean by this. There's no need to reformat the
namenode if datanodes die.

As for passwordless login, that is a convenience used *only* by the
start-*.sh scripts. Hadoop itself does not rely on SSH in any way.

-Todd

-----Original Message-----
> From: Todd Lipcon [mailto:[email protected]]
> Sent: Wednesday, July 29, 2009 11:58 PM
> To: [email protected]
> Subject: Re: To retrieve data on dead node
>
> On Wed, Jul 29, 2009 at 8:51 AM, bhushan_mahale <
> [email protected]> wrote:
>
> > Hi,
> >
> > What are the possible ways to retrieve the data if a node goes down in a
> > Hadoop cluster?
> >
> > Assuming replication factor as 3, and 3 nodes goes down in a 10 node
> > cluster, how do we retrieve the data?
> >
>
> Hi Bhushan,
>
> If 3 nodes go down at the same time, some of your data will become
> inaccessible. If you cannot recover at least one of those nodes, you will
> have no way to recover the data. If you can recover at least one, then the
> blocks will become available at replication count 1. The NN will notice the
> underreplicated blocks and trigger rereplication to get them back up to 3.
>
> If your nodes fail one-by-one with some time in between, the NN should have
> time to trigger rereplication between them and the blocks will never be
> inaccessible.
>
> In general, simultaneous failures occur in two ways in the datacenter: one
> is that the entire datacenter has lost power (or forced shutdown due to
> lost
> cooling). In this case, no amount of replication within the DC will help.
> The other failure is that power (or network) is lost to an entire rack,
> either due to a switch failure or a failed PDU. If you've configured
> Hadoop's rack-awareness, it will ensure that each block is replicated on at
> least two racks to mitigate the downside of a rack loss.
>
> Depending on your particular setup, it may be worth putting your 10-node
> cluster spread across separate power circuits and configuring them as
> separate "racks" in Hadoop, if you're concerned about flaky rack PDUs.
>
> Hope that helps
> -Todd
>
>
> > Thanks,
> > - Bhushan
> >
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> > the property of Persistent Systems Ltd. It is intended only for the use
> of
> > the individual or entity to which it is addressed. If you are not the
> > intended recipient, you are not authorized to read, retain, copy, print,
> > distribute or use this message. If you have received this communication
> in
> > error, please notify the sender and delete all copies of this message.
> > Persistent Systems Ltd. does not accept any liability for virus infected
> > mails.
> >
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Reply via email to