Yeah what he said.
Its never a good idea.
Forget about losing a NN or a Rack, but just losing connectivity between data 
centers. (It happens more than you think.)
Your entire cluster in both data centers go down. Boom!

Its a bad design. 

You're better off doing two different clusters.

Is anyone really trying to sell this as a design? That's even more scary.


> Subject: Re: Hadoop cluster network requirement
> From: [email protected]
> Date: Sun, 31 Jul 2011 20:28:53 -0700
> To: [email protected]; [email protected]
> 
> 
> On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
> 
> > Thanks, I'm independently doing some digging into Hadoop networking
> > requirements and 
> > had a couple of quick follow-ups. Could I have some specific info on why
> > different data centers 
> > cannot be supported for master node and data node comms?
> > Also, what 
> > may be the benefits/use cases for such a scenario?
> 
>       Most people who try to put the NN and DNs in different data centers are 
> trying to achieve disaster recovery:  one file system in multiple locations.  
> That isn't the way HDFS is designed and it will end in tears. There are 
> multiple problems:
> 
> 1) no guarantee that one block replica will be each data center (thereby 
> defeating the whole purpose!)
> 2) assuming one can work out problem 1, during a network break, the NN will 
> lose contact from one half of the  DNs, causing a massive network replication 
> storm
> 3) if one using MR on top of this HDFS, the shuffle will likely kill the 
> network in between (making MR performance pretty dreadful) is going to cause 
> delays for the DN heartbeats
> 4) I don't even want to think about rebalancing.
> 
>       ... and I'm sure a lot of other problems I'm forgetting at the moment.  
> So don't do it.
> 
>       If you want disaster recovery, set up two completely separate HDFSes 
> and run everything in parallel.
                                          

Reply via email to