Anthony Ikeda wrote:
Thanks Hemanth,
In regards to different locations of the HADOOP home this is low
priority more for testing not production. I was trying to install HADOOP
for testing over 2 machines with only a Windows XP machine running
Cygwin and a Mac running Darwin. Not a priority.
Things are much easier if
-all your machines have the same OS, disk structure
-you are running on linux
-you use some CM tool to automate setup/deploy, pushing out of config
files
Start now, start with VMWare or virtualbox images now, so you learn
about management sooner rather than later
In regards to my last question about operating in a detached fashion, we
are trying to factor in what happens when the link between both sites is
cut. Will both sites operate independently until the connection is
re-established? Is there any particular setup required to ensure we can
cover this scenario or is it an out-of-the-box feature?
HDFS and the MapReduce engine is designed to run on a single datacentre
with high bandwidth, high reliability links, current releases assume the
facility is secure and all users are trusted. The key SPOF, the
Namenode, doesn't do failover, so when it goes down or the network
partitions, all machines that cannot see the NN poll and spin until it
comes back -which can take a while, unless you have a secondary namenode
to keep the persistent files up to date. the workers all assume that
the hostname and IPAddr of the namenode doesn't change, and never reread
their config. You could use DNS to do failover, but you have to tune the
JVMs to not cache IP addresses for very long.
To do cross site stuff you'd need a separate HDFS filesystem per site,
synchronisation of data now becomes a task for the higher level apps. I
don't know what HBase, Cassandra or other column DB tools do here.
-steve