Hi James, This doesn't quite answer your original question, but if you want to help track down these kinds of bugs, you should grab a stack trace next time this happens.
You can do this either using "jstack" from the command line, by visiting /stacks on the HTTP interface, or by sending the process a SIGQUIT (kill -QUIT <pid>). If you go the SIGQUIT route, the stack dump will show up in that daemon's stdout log (logs/hadoop-....out). Oftentimes the stack trace will be enough for the developers to track down a deadlock, or it may point to some sort of configuration issue on your machine. -Todd On Wed, Oct 7, 2009 at 11:19 PM, james warren <[email protected]> wrote: > Quick question for the hadoop / linux masters out there: > > I recently observed a stalled tasktracker daemon on our production cluster, > and was wondering if there were common tests to detect failures so that > administration tools (e.g. monit) can automatically restart the daemon. > The > particular observed symptoms were: > > - the node was dropped by the jobtracker > - information in /proc listed the tasktracker process as sleeping, not > zombie > - the web interface (port 50060) was unresponsive, though telnet did > connect > - no error information in the hadoop logs -- they simply were no longer > being updated > > I certainly cannot be the first person to encounter this - anyone have a > neat and tidy solution they could share? > > (And yes, we will eventually we go down the nagios / ganglia / cloudera > desktop path but we're waiting until we're running CDH2.) > > Many thanks, > -James Warren >
