Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by SteveLoughran: http://wiki.apache.org/hadoop/LargeClusterTips The comment on the change is: formatting ------------------------------------------------------------------------------ Things will go wrong. There is always SPOF. Test your failure handling processes before you go live. - * Simulate a corrupted edit log by killing the namenode process, truncating the (binary) edit log, and bringing it up. See how the team handles it. + * Simulate a corrupted edit log by killing the namenode process, truncating the (binary) edit log, and bringing it up. See how the team handles it. - * Turn off one of the switches, pull out a network cable. See how the cluster handles it, how it recovers. Then put the switch back on. + * Turn off one of the switches, pull out a network cable. See how the cluster handles it, how it recovers. Then put the switch back on. - * Turn an entire rack off without warning. See what happens when they go offline. + * Turn an entire rack off without warning. See what happens when they go offline. - * Turn off DNS. + * Turn off DNS. Or just the rDNS side of things. - * Turn off the entire datacenter, switch it back on. Are there any race conditions? + * Turn off the entire datacenter, switch it back on. Are there any race conditions? - * Write an job that tries to generate too much data, fills up the cluster. How is it handled? + * Write an job that tries to generate too much data, fills up the cluster. How is it handled?
