Re: Active-Active Performance

Steve Loughran Tue, 25 May 2010 03:04:38 -0700

Anthony Ikeda wrote:

Thanks Hemanth,


In regards to different locations of the HADOOP home this is low
priority more for testing not production. I was trying to install HADOOP
for testing over 2 machines with only a Windows XP machine running
Cygwin and a Mac running Darwin. Not a priority.


Things are much easier if
 -all your machines have the same OS, disk structure
 -you are running on linux

-you use some CM tool to automate setup/deploy, pushing out of configfiles

Start now, start with VMWare or virtualbox images now, so you learnabout management sooner rather than later

In regards to my last question about operating in a detached fashion, we
are trying to factor in what happens when the link between both sites is
cut. Will both sites operate independently until the connection is
re-established? Is there any particular setup required to ensure we can
cover this scenario or is it an out-of-the-box feature?

HDFS and the MapReduce engine is designed to run on a single datacentrewith high bandwidth, high reliability links, current releases assume thefacility is secure and all users are trusted. The key SPOF, theNamenode, doesn't do failover, so when it goes down or the networkpartitions, all machines that cannot see the NN poll and spin until itcomes back -which can take a while, unless you have a secondary namenodeto keep the persistent files up to date. the workers all assume thatthe hostname and IPAddr of the namenode doesn't change, and never rereadtheir config. You could use DNS to do failover, but you have to tune theJVMs to not cache IP addresses for very long.

To do cross site stuff you'd need a separate HDFS filesystem per site,synchronisation of data now becomes a task for the higher level apps. Idon't know what HBase, Cassandra or other column DB tools do here.



-steve

Re: Active-Active Performance

Reply via email to