Edward Capriolo wrote:

I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.


HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has its own branch.

This is still something I'm working on, the code works, all the tests work, but there are some quirks with JobTracker startup now that it blocks waiting for the filesystem to come up that I'm not happy with; I need to add some new tests/mechanisms to shut down a service while it is still starting up, which includes interrupting the JT and TT.

You can get RPMs with all this stuff packaged up for use from http://smartfrog.org/ , with the caveat that it's still fairly unstable.

I am currently work on the other side of the equation, integration with multiple cloud infrastructures, with all the fun testing issues that follow:
http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0


* The simplest liveness test for any of the workers right now is to hit their HTTP pages, its the classic "happy" test. We can and should extend this with more self-tests, some equivalent of Axis's happy.jsp. The nice thing about these is they integrate well with all the existing web page monitoring tools, though I should warn that the same tooling that tracks and reports the health of a four-way app server doesn't really scale to keeping an eye on 3000 task trackers. It's not the monitoring, but the reporting.

* Detecting failures of TTs and DNs is kind of tricky too; it's really the namenode and jobtracker that know best. We need to get some reporting in there so that when either of the masters think that one of their workers is playing up, they report it to whatever plugin wants to know.

* Handling failures of VMs is very different from physical machines. You just kill the VM and restart a new one. We don't need all the blacklisting stuff, just some infrastructure operations and various notifications to the ops team.

-steve





Reply via email to