Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/LargeClusterTips

The comment on the change is:
more ideas, including things to test before you go live

------------------------------------------------------------------------------
  
   * Have a good sysadmin if you're not one yourself.
   * Take a look at a presentation done by Allen Wittenauer from Yahoo!: 
http://tinyurl.com/5foamm
-  * Have the LAN closed off to untrusted users. This simplifies security.
+  * Have the LAN closed off to untrusted users. Without this, your filesystem 
is effectively open to everyone on the network.
+  * Once you are on the private LAN, turn off all firewalls on the machines, 
as it only creates connectivity problems.
   * Use LDAP or similar to manage user accounts.
   * Only put the slaves file on your namenode and secondary namenode to 
prevent confusion.
+ 
-  * Have identical hardware on all machines in the cluster, eliminating the 
need to have different
-    configuration options (task slots, data directory locations, etc)
   * Use RPMs to install the Hadoop binaries. Self:Cloudera provide some RPMs 
for this, and a web site to generate configuration RPM files.
   * Use kickstart or similar to bring up the machines. 
-  * Consider a system configuration management package to keep Hadoop's source 
and configuration consistent across all nodes.  Some example packages are 
bcfg2, smartfrog, puppet, cfengine, etc. 
   * If you are trying to configure the machines one by one, step away from the 
keyboard. That is not the way to manage a cluster.
+  * Keep an eye out for disk SMART messages in the server logs. They warn of 
trouble.
+  * Keep an eye on disk capacity, especially on the namenode. You do not want 
the NN to run out of storage as Bad Things happen.
+  * Keep the underlying software in sync: OS, Java version. 
+  * Run the rebalancer, throttled back appropriately for your bandwidth
+ 
  
  See the Self:AmazonEC2 and AmazonS3 pages for tips on managing clusters built 
on EC2 and S3.
  
  Other good documentation: 
[http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment 
Patterns of Hadoop Deployment]
  
+ == Hadoop Configuration ==
+ 
+  * Don't do it by copying XML files around by hand.
+  * Look at the cloudera config tools. If you use them, keep the previous RPMs 
around
+  * Consider a system configuration management package to keep Hadoop's source 
and configuration consistent across all nodes.  Some example packages are 
bcfg2, SmartFrog, Puppet, cfengine, etc. 
+  * Keep your site XML files under SCM, so you can roll-back, diff changes. 
+ 
+ 
+ == NameNode Health ==
+ 
+ The NameNode is a SPOF. When it goes offline, the cluster goes down. If it 
loses its data, the filesystem is gone. Value it.
+  * Have a secondary name node! When the BackupNode replaces this, have a 
BackupNode!
+  * Never let its disks fill up.
+  * Consider RAID storage here. If not, set it to save its data to two 
independent drives, ideally on separate controllers (just in case the 
controller decides to play up)
+  * Set the NN up to save one copy of all its data to a remote machine (NFS?), 
so even if the NN goes down, you can bring up a new machine with the same 
hostname for everything else to bind to.
+ 
+ 
+ == Workers ==
+ 
+  * Have identical hardware on all workers in the cluster, eliminating the 
need to have different configuration options (task slots, data directory 
locations, etc.)
+  * Have a common user account on every machine you run Hadoop on, with a 
common public key in ~/.ssh/authorized_keys
+  * Track HDDs, their history and their failures. Disk failures are not always 
independent in a large datacentre.
+  * Have simple hostname to rack or IP to rack mappings, so the rack detection 
scripts are trivial.
+ 
+ === How to rebalance a full datanode ===
+ 
+ If a datanode is at or near 100% capacity, 
+  1. Decommission the node: this will copy everything off. 
+  2. Take it offline.
+  3. Delete the data, clean up the HDDs.
+  4. Add the node again. 
+ 
+ == Testing Failure ==
+ 
+ Things will go wrong. There is always SPOF. Test your failure handling 
processes before you go live. 
+ 
+ * Simulate a corrupted edit log by killing the namenode process, truncating 
the (binary) edit log, and bringing it up. See how the team handles it. 
+ * Turn off one of the switches, pull out a network cable. See how the cluster 
handles it, how it recovers. Then put the switch back on.
+ * Turn an entire rack off without warning. See what happens when they go 
offline.
+ * Turn off DNS. 
+ * Turn off the entire datacenter, switch it back on. Are there any race 
conditions?
+ * Write an job that tries to generate too much data, fills up the cluster. 
How is it handled?
+ 

Reply via email to