As you may have heard, the latest release of HDP-1 has some HA monitoring logic in it
Specifically 1. a canary monitor for VMWare -a new init.d daemon that monitors HDFS, JT or anything with 1 or more of (PID, port, URL) & stops singing when any of them fails or a probe blocks for too long. There's some complex lifecycle work at startup to deal with (a) slow booting services and (b) the notion of upstream dependencies -there's no point reporting a JT startup failure if the NN is offline, as that is what is blocking the JT from opening its ports. 2. a Linux HA Resource Agent that uses the same liveness probes as the canary monitor, but is invoked from a Linux HA bash script. This replaces the init.d script on an HA cluster, relying on LinuxHA to start and stop it. 3. RPM packaging for these Test wise, along with the unit tests of all the various probes, there's: 1. A Groovy gui, Hadoop Availability Monitor, that can show what is going on and attempt FS and JT operations. Useful for demos and monitoring what is going on 2. An MR job designed to trigger task failures when the NN is down -by executing FS operations in map or reduce phases. This is needed to verify that the JT doesn't over-react when HDFS is down. 3. the beginnings of a library to trigger failures on different infrastructure, "apache chaos". To date it handles vbox and human intervention (it brings up a dialog). Manual is quite good to coordinate physical actions like pulling power out. 4. Test cases that do things like ssh in to machines, kill processes & verify the FS comes back up. There's some slides on this on slideshare, but the animation is missing -I can email out the full PPTX if someone really wants to see it. http://www.slideshare.net/steve_l/availability-and-integrity-in-hadoop-strata-eu-edition I think the best home for this is not some hadoop/contrib package but bigtop -it fits in with the notion of bigtop being where the daemon scripts live, but where the RPMs come from. It also fits in with the groovy test architecture -being able to hand closures down to trigger different system failures turns out to be invaluable. I think the chaos stuff, if it were actually expanded to work with other virtual infras (by way of jclouds), and more physical stuff (exec fencing device scripts, ssh into to linksys routers and cause network partitions) could be useful for anyone trying to create system failures during test runs. It's slightly different from the Netflix Chaos Monkey in that the monkey kills production servers on a whim; this could do the same if a process were set up to use the scripts -for testing I want choreographed outages, and more aggressive failure simulation during large scale test runs of the layers above. I also want to simulate more failures than just "VM goes away", as its those more complex failures that show up in the physical world (net partition, process hang, process die). If people think that this looks like a good match, I'll go into more detail on what there is and what else could be done. One thing I'd like to do is add a new reporter to the canary monitor daemon that handles service failures by killing and restarting the process. This could be deployed on all worker nodes to keep an eye on the TT, DN & region server, for better automated handling of things like the HTTPD in the TT blocking all callers, as Jetty is known to do from time to time. -Steve