Edward Capriolo wrote:
I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.
HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has its
own branch.
This is still something I'm working on, the code works, all the tests
work, but there are some quirks with JobTracker startup now that it
blocks waiting for the filesystem to come up that I'm not happy with; I
need to add some new tests/mechanisms to shut down a service while it is
still starting up, which includes interrupting the JT and TT.
You can get RPMs with all this stuff packaged up for use from
http://smartfrog.org/ , with the caveat that it's still fairly unstable.
I am currently work on the other side of the equation, integration with
multiple cloud infrastructures, with all the fun testing issues that follow:
http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0
* The simplest liveness test for any of the workers right now is to hit
their HTTP pages, its the classic "happy" test. We can and should extend
this with more self-tests, some equivalent of Axis's happy.jsp. The nice
thing about these is they integrate well with all the existing web page
monitoring tools, though I should warn that the same tooling that tracks
and reports the health of a four-way app server doesn't really scale to
keeping an eye on 3000 task trackers. It's not the monitoring, but the
reporting.
* Detecting failures of TTs and DNs is kind of tricky too; it's really
the namenode and jobtracker that know best. We need to get some
reporting in there so that when either of the masters think that one of
their workers is playing up, they report it to whatever plugin wants to
know.
* Handling failures of VMs is very different from physical machines. You
just kill the VM and restart a new one. We don't need all the
blacklisting stuff, just some infrastructure operations and various
notifications to the ops team.
-steve