On Jul 12, 2011, at 4:34 PM, <samdispmail-tru...@yahoo.com>
 <samdispmail-tru...@yahoo.com> wrote:
> I am working on deploying Hadoop on a small cluster. For now, I am interested 
> in 
> restarting (restart the node or even reboot the OS) the nodes Hadoop detects 
> as 
> crashed.

        There are quite a few scenarios where one service may be up but another 
may be down.  So per-service is usually a better way to go.

> "Instead, one should monitor the namenode and jobtracker and alert based on a 
> percentage of availability.  ... "
> Indeed.
> I use Hadoop 0.20.203.

        OK, then that means...

> 
> "This can be done in a variety of ways, ..."
> Can you please provide any pointers.

        ... you're pretty much required to use JMX to query the NN and JT to 
get node information, since the rest of the APIs weren't forward ported as 
promised---Ganglia is out of the equation anyway.   Luckily, it is fairly 
trivial to setup a Nagios script to poll that information (and our experiences 
say that information is actually working.  Some stuff in the metricsv2 API 
doesn't appear to be working properly on the DN and TT.).  

> Do you know how I can access the monitoring information of the namenode or 
> the 
> jobtracker so I can extract a list of failed  nodes?

        Take a look at the DeadNodes and LiveNodes attributes in the NameNode 
and JobTracker section of the Hadoop MBean.  That's likely your best bet.  

> 
> Why I thought of using metrics information, is because they are periodic and 
> seemed easy to access. I though of using them as heart beats only (i.e. if I 
> do 
> not receive the metric in 2-3 periods I reset the node).

        You end up essentially doing the same that the NN and JT are doing... 
so might as well just ask them rather than doing it again, generating even more 
network traffic than necessary.   Additionally, there are some failures where 
the NN or JT may view a service daemon as down but it actually responds to 
other queries (from thread death/lock-up).  For example, we've got a job that 
has on occasion tripped up the 0.20.2 DN with OOM issues.  The process lies in 
a psuedo-dead state due to some weird exception handling down in the bowels of 
the code.  The NN rightfully declares it as dead, but depending upon how you 
ask the node itself, it may respond!

        So be careful out there.

Reply via email to