Hi James,
This doesn't quite answer your original question, but if you want to help
track down these kinds of bugs, you should grab a stack trace next time this
happens.

You can do this either using "jstack" from the command line, by visiting
/stacks on the HTTP interface, or by sending the process a SIGQUIT (kill
-QUIT <pid>). If you go the SIGQUIT route, the stack dump will show up in
that daemon's stdout log (logs/hadoop-....out).

Oftentimes the stack trace will be enough for the developers to track down a
deadlock, or it may point to some sort of configuration issue on your
machine.

-Todd


On Wed, Oct 7, 2009 at 11:19 PM, james warren <[email protected]> wrote:

> Quick question for the hadoop / linux masters out there:
>
> I recently observed a stalled tasktracker daemon on our production cluster,
> and was wondering if there were common tests to detect failures so that
> administration tools (e.g. monit) can automatically restart the daemon.
>  The
> particular observed symptoms were:
>
>   - the node was dropped by the jobtracker
>   - information in /proc listed the tasktracker process as sleeping, not
>   zombie
>   - the web interface (port 50060) was unresponsive, though telnet did
>   connect
>   - no error information in the hadoop logs -- they simply were no longer
>   being updated
>
> I certainly cannot be the first person to encounter this - anyone have a
> neat and tidy solution they could share?
>
> (And yes, we will eventually we go down the nagios / ganglia / cloudera
> desktop path but we're waiting until we're running CDH2.)
>
> Many thanks,
> -James Warren
>

Reply via email to