Quick question for the hadoop / linux masters out there:

I recently observed a stalled tasktracker daemon on our production cluster,
and was wondering if there were common tests to detect failures so that
administration tools (e.g. monit) can automatically restart the daemon.  The
particular observed symptoms were:

   - the node was dropped by the jobtracker
   - information in /proc listed the tasktracker process as sleeping, not
   zombie
   - the web interface (port 50060) was unresponsive, though telnet did
   connect
   - no error information in the hadoop logs -- they simply were no longer
   being updated

I certainly cannot be the first person to encounter this - anyone have a
neat and tidy solution they could share?

(And yes, we will eventually we go down the nagios / ganglia / cloudera
desktop path but we're waiting until we're running CDH2.)

Many thanks,
-James Warren

Reply via email to