adding HA monitoring to bigtop

Steve Loughran Mon, 08 Oct 2012 02:03:51 -0700

As you may have heard, the latest release of HDP-1 has some HA monitoring
logic in it


Specifically

   1. a canary monitor for VMWare -a new init.d daemon that monitors HDFS,
   JT or anything with 1 or more of (PID, port, URL) & stops singing when any
   of them fails or a probe blocks for too long. There's some complex
   lifecycle work at startup to deal with (a) slow booting services and (b)
   the notion of upstream dependencies -there's no point reporting a JT
   startup failure if the NN is offline, as that is what is blocking the JT
   from opening its ports.
   2. a Linux HA Resource Agent that uses the same liveness probes as the
   canary monitor, but is invoked from a Linux HA bash script. This replaces
   the init.d script on an HA cluster, relying on LinuxHA to start and stop it.
   3. RPM packaging for these

Test wise, along with the unit tests of all the various probes, there's:

   1. A Groovy gui, Hadoop Availability Monitor, that can show what is
   going on and attempt FS and JT operations. Useful for demos and monitoring
   what is going on
   2. An MR job designed to trigger task failures when the NN is down -by
   executing FS operations in map or reduce phases. This is needed to verify
   that the JT doesn't over-react when HDFS is down.
   3. the beginnings of a library to trigger failures on different
   infrastructure, "apache chaos". To date it handles vbox and human
   intervention (it brings up a dialog). Manual is quite good to coordinate
   physical actions like pulling power out.
   4. Test cases that do things like ssh in to machines, kill processes &
   verify the FS comes back up.


There's some slides on this on slideshare, but the animation is missing -I
can email out the full PPTX if someone really wants to see it.

http://www.slideshare.net/steve_l/availability-and-integrity-in-hadoop-strata-eu-edition

I think the best home for this is not some hadoop/contrib package but
bigtop -it fits in with the notion of bigtop being where the daemon scripts
live, but where the RPMs come from. It also fits in with the groovy test
architecture -being able to hand closures down to trigger different system
failures turns out to be invaluable.

I think the chaos stuff, if it were actually expanded to work with other
virtual infras (by way of jclouds), and more physical stuff (exec fencing
device scripts, ssh into to linksys routers and cause network partitions)
could be useful for anyone trying to create system failures during test
runs. It's slightly different from the Netflix Chaos Monkey in that the
monkey kills production servers on a whim; this could do the same if a
process were set up to use the scripts -for testing I want choreographed
outages, and more aggressive failure simulation during large scale test
runs of the layers above. I also want to simulate more failures than just
"VM goes away", as its those more complex failures that show up in the
physical world (net partition, process hang, process die).

If people think that this looks like a good match, I'll go  into more
detail on what there is and what else could be done. One thing I'd like to
do is add a new reporter to the canary monitor daemon that handles service
failures by killing and restarting the process. This could be deployed on
all worker nodes to keep an eye on the TT, DN & region server, for better
automated handling of things like the HTTPD in the TT blocking all callers,
as Jetty is known to do from time to time.

-Steve

adding HA monitoring to bigtop

Reply via email to