Re: detecting stalled daemons?

Steve Loughran Thu, 15 Oct 2009 04:10:13 -0700

Edward Capriolo wrote:

I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.

HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has itsown branch.

This is still something I'm working on, the code works, all the testswork, but there are some quirks with JobTracker startup now that itblocks waiting for the filesystem to come up that I'm not happy with; Ineed to add some new tests/mechanisms to shut down a service while it isstill starting up, which includes interrupting the JT and TT.

You can get RPMs with all this stuff packaged up for use fromhttp://smartfrog.org/ , with the caveat that it's still fairly unstable.

I am currently work on the other side of the equation, integration withmultiple cloud infrastructures, with all the fun testing issues that follow:

http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0

* The simplest liveness test for any of the workers right now is to hittheir HTTP pages, its the classic "happy" test. We can and should extendthis with more self-tests, some equivalent of Axis's happy.jsp. The nicething about these is they integrate well with all the existing web pagemonitoring tools, though I should warn that the same tooling that tracksand reports the health of a four-way app server doesn't really scale tokeeping an eye on 3000 task trackers. It's not the monitoring, but thereporting.

* Detecting failures of TTs and DNs is kind of tricky too; it's reallythe namenode and jobtracker that know best. We need to get somereporting in there so that when either of the masters think that one oftheir workers is playing up, they report it to whatever plugin wants toknow.

* Handling failures of VMs is very different from physical machines. Youjust kill the VM and restart a new one. We don't need all theblacklisting stuff, just some infrastructure operations and variousnotifications to the ops team.


-steve

Re: detecting stalled daemons?

Reply via email to