RE: Hadoop cluster network requirement

2011-08-01 Thread Michael Segel

Yeah what he said.
Its never a good idea.
Forget about losing a NN or a Rack, but just losing connectivity between data 
centers. (It happens more than you think.)
Your entire cluster in both data centers go down. Boom!

Its a bad design. 

You're better off doing two different clusters.

Is anyone really trying to sell this as a design? That's even more scary.


 Subject: Re: Hadoop cluster network requirement
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 20:28:53 -0700
 To: common-user@hadoop.apache.org; saq...@margallacomm.com
 
 
 On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
 
  Thanks, I'm independently doing some digging into Hadoop networking
  requirements and 
  had a couple of quick follow-ups. Could I have some specific info on why
  different data centers 
  cannot be supported for master node and data node comms?
  Also, what 
  may be the benefits/use cases for such a scenario?
 
   Most people who try to put the NN and DNs in different data centers are 
 trying to achieve disaster recovery:  one file system in multiple locations.  
 That isn't the way HDFS is designed and it will end in tears. There are 
 multiple problems:
 
 1) no guarantee that one block replica will be each data center (thereby 
 defeating the whole purpose!)
 2) assuming one can work out problem 1, during a network break, the NN will 
 lose contact from one half of the  DNs, causing a massive network replication 
 storm
 3) if one using MR on top of this HDFS, the shuffle will likely kill the 
 network in between (making MR performance pretty dreadful) is going to cause 
 delays for the DN heartbeats
 4) I don't even want to think about rebalancing.
 
   ... and I'm sure a lot of other problems I'm forgetting at the moment.  
 So don't do it.
 
   If you want disaster recovery, set up two completely separate HDFSes 
 and run everything in parallel.
  

Re: Hadoop cluster network requirement

2011-08-01 Thread Mohit Anchlia
Assuming everything is up this solution still will not scale given the latency, 
tcpip buffers, sliding window etc. See BDP

Sent from my iPad

On Aug 1, 2011, at 4:57 PM, Michael Segel michael_se...@hotmail.com wrote:

 
 Yeah what he said.
 Its never a good idea.
 Forget about losing a NN or a Rack, but just losing connectivity between data 
 centers. (It happens more than you think.)
 Your entire cluster in both data centers go down. Boom!
 
 Its a bad design. 
 
 You're better off doing two different clusters.
 
 Is anyone really trying to sell this as a design? That's even more scary.
 
 
 Subject: Re: Hadoop cluster network requirement
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 20:28:53 -0700
 To: common-user@hadoop.apache.org; saq...@margallacomm.com
 
 
 On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
 
 Thanks, I'm independently doing some digging into Hadoop networking
 requirements and 
 had a couple of quick follow-ups. Could I have some specific info on why
 different data centers 
 cannot be supported for master node and data node comms?
 Also, what 
 may be the benefits/use cases for such a scenario?
 
Most people who try to put the NN and DNs in different data centers are 
 trying to achieve disaster recovery:  one file system in multiple locations. 
  That isn't the way HDFS is designed and it will end in tears. There are 
 multiple problems:
 
 1) no guarantee that one block replica will be each data center (thereby 
 defeating the whole purpose!)
 2) assuming one can work out problem 1, during a network break, the NN will 
 lose contact from one half of the  DNs, causing a massive network 
 replication storm
 3) if one using MR on top of this HDFS, the shuffle will likely kill the 
 network in between (making MR performance pretty dreadful) is going to cause 
 delays for the DN heartbeats
 4) I don't even want to think about rebalancing.
 
... and I'm sure a lot of other problems I'm forgetting at the moment.  
 So don't do it.
 
If you want disaster recovery, set up two completely separate HDFSes and 
 run everything in parallel.
 


Hadoop cluster network requirement

2011-07-31 Thread jonathan.hwang
I was asked by our IT folks if we can put hadoop name nodes storage using a 
shared disk storage unit.  Does anyone have experience of how much IO 
throughput is required on the name nodes?  What are the latency/data throughput 
requirements between the master and data nodes - can this tolerate network 
routing?

Did anyone published any throughput requirement for the best network setup 
recommendation?

Thanks!
Jonathan



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.


Re: Hadoop cluster network requirement

2011-07-31 Thread Allen Wittenauer

On Jul 31, 2011, at 12:08 PM, jonathan.hw...@accenture.com
 jonathan.hw...@accenture.com wrote:

 I was asked by our IT folks if we can put hadoop name nodes storage using a 
 shared disk storage unit.  

What do you mean by shared disk storage unit?  There are lots of 
products out there that would claim this, so actual deployment semantics are 
important.

 Does anyone have experience of how much IO throughput is required on the name 
 nodes?

IO throughput is completely dependent dependent upon how many changes 
are being applied to the file system and frequency of edits log merging.  In 
the majority of cases it is not much.  What tends to happen where the storage 
is shared (such as a NAS) is that the *other* traffic blocks the writes for too 
long because it is overloaded and the NN declares it dead.

  What are the latency/data throughput requirements between the master and 
 data nodes - can this tolerate network routing?

If you mean different data centers, then no.  If you mean same data 
center, but with routers in between, then probably yes, but you add several 
more failure points, so this isn't recommended. 

 Did anyone published any throughput requirement for the best network setup 
 recommendation?

Not that I know of.  It is very much dependent upon the actual workload 
being performed.  But I wouldn't deploy anything slower than a 1:4 overcommit 
(uplink-to-host) on the DN side for anything real/significant.

 This message is for the designated recipient only and may contain privileged, 
 proprietary, or otherwise private information. If you have received it in 
 error, please notify the sender immediately and delete the original. Any 
 other use of the email by you is prohibited.

Lawyers are funny people.  I wonder how much they got paid for this one.

RE: Hadoop cluster network requirement

2011-07-31 Thread Saqib Jang -- Margalla Communications
Thanks, I'm independently doing some digging into Hadoop networking
requirements and 
had a couple of quick follow-ups. Could I have some specific info on why
different data centers 
cannot be supported for master node and data node comms? Also, what 
may be the benefits/use cases for such a scenario?

Saqib

-Original Message-
From: jonathan.hw...@accenture.com [mailto:jonathan.hw...@accenture.com] 
Sent: Sunday, July 31, 2011 12:09 PM
To: common-user@hadoop.apache.org
Subject: Hadoop cluster network requirement

I was asked by our IT folks if we can put hadoop name nodes storage using a
shared disk storage unit.  Does anyone have experience of how much IO
throughput is required on the name nodes?  What are the latency/data
throughput requirements between the master and data nodes - can this
tolerate network routing?

Did anyone published any throughput requirement for the best network setup
recommendation?

Thanks!
Jonathan



This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise private information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the email by you is prohibited.



Re: Hadoop cluster network requirement

2011-07-31 Thread Allen Wittenauer

On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:

 Thanks, I'm independently doing some digging into Hadoop networking
 requirements and 
 had a couple of quick follow-ups. Could I have some specific info on why
 different data centers 
 cannot be supported for master node and data node comms?
 Also, what 
 may be the benefits/use cases for such a scenario?

Most people who try to put the NN and DNs in different data centers are 
trying to achieve disaster recovery:  one file system in multiple locations.  
That isn't the way HDFS is designed and it will end in tears. There are 
multiple problems:

1) no guarantee that one block replica will be each data center (thereby 
defeating the whole purpose!)
2) assuming one can work out problem 1, during a network break, the NN will 
lose contact from one half of the  DNs, causing a massive network replication 
storm
3) if one using MR on top of this HDFS, the shuffle will likely kill the 
network in between (making MR performance pretty dreadful) is going to cause 
delays for the DN heartbeats
4) I don't even want to think about rebalancing.

... and I'm sure a lot of other problems I'm forgetting at the moment.  
So don't do it.

If you want disaster recovery, set up two completely separate HDFSes 
and run everything in parallel.