Re: How to query a slave node for monitoring information

Nathan Milford Tue, 12 Jul 2011 18:22:01 -0700

If you wanna be kinda ghetto about it on a 5 node test cluster you could:

if curl namenode:50070; then
 echo "YAY it's up"
else
 echo "It's down"
 restartNamenodeCommand
fi


But, if you have a proper cluster you're better off using Nagios
or something similar.

If you want down/up information in Nagios:

service_description NameNode
check_command  check_http!50070

service_description JobTracker
check_command  check_http!50030

service_description TaskTracker
check_command  check_http!50060

service_description Secondary NameNode
check_command  check_http!50090

service_description DataNode
check_command  check_http!50075

If you want metrics for thresholds in Nagios:

Modify hadoop-metrics.properties to expose the /metrics URL and run
something like:
http://exchange.nagios.org/directory/Plugins/Others/check_hadoop_metrics/details

Similar to what Allen suggested, we also have a script the scrapes the
NameNode and JobTracker pages and gets the number of nodes reporting and
alerts if we fall below a threshold.

- Nathan Milford


On Tue, Jul 12, 2011 at 7:34 PM, <samdispmail-tru...@yahoo.com> wrote:

> Thank you very much Allen,
>
> "common-user@ would likely have been better, but I'm too lazy to forward
> you there today. :)"
> Thank you :-)
>
>
> "Do you want monitoring information or metrics information? "
> I need monitoring information.
> I am working on deploying Hadoop on a small cluster. For now, I am
> interested in restarting (restart the node or even reboot the OS) the nodes
> Hadoop detects as crashed.
>
> "Instead, one should monitor the namenode and jobtracker and alert based on
> a percentage of availability.  ... "
> Indeed.
> I use Hadoop 0.20.203.
>
> "This can be done in a variety of ways, ..."
> Can you please provide any pointers.
> Do you know how I can access the monitoring information of the namenode or
> the jobtracker so I can extract a list of failed nodes?
>
> Thank you very much for your help
>
> P.S.:
> Why I thought of using metrics information, is because they are periodic
> and seemed easy to access. I though of using them as heart beats only (i.e.
> if I do not receive the metric in 2-3 periods I reset the node).
>
> Thank you
>
> -sam
>
> ------------------------------
> *From:* Allen Wittenauer <a...@apache.org>
> *To:* mapreduce-user@hadoop.apache.org
> *Sent:* Tue, July 12, 2011 3:13:42 PM
> *Subject:* Re: How to query a slave node for monitoring information
>
> On Jul 12, 2011, at 3:02 PM, <samdispmail-tru...@yahoo.com>
> <samdispmail-tru...@yahoo.com> wrote:
> > I am new to Hadoop, and I apologies if this was answered before, or if
> this is
> > not the right list for my question.
>
>     common-user@ would likely have been better, but I'm too lazy to
> forward you there today. :)
>
> >
> > I am trying to do the following:
> > 1- Read monitoring information from slave nodes in hadoop
> > 2- Process the data to detect nodes failure (node crash, problems in
> requests
> > ... etc) and decide if I need to restart the whole machine.
> > 3- Restart the machine running the slave facing problems
>
>
>     At scale, one doesn't monitor individual nodes for up/down.  Verifying
> the up/down of a given node will drive you insane and is pretty much a waste
> of time unless the grid itself is under-configured to the point that *every*
> *node* *counts*.  (If that is the case, then there are bigger issues
> afoot...)
>
>     Instead, one should monitor the namenode and jobtracker and alert based
> on a percentage of availability.  This can be done in a variety of ways,
> depending upon which version of Hadoop is in play.  For 0.20.2, a simple
> screen scrape is good enough.  I recommend warn on 10%, alert on 20%, panic
> on 30%.
>
> > My question is for step 1- collecting monitoring information.
> > I have checked Hadoop monitoring features. But currently you can forward
> the
> > motioning data to files, or to Ganglia.
>
>
>     Do you want monitoring information or metrics information?  Ganglia is
> purely a metrics tool.  Metrics are a different animal.  While it is
> possible to alert on them, in most cases they aren't particular useful in a
> monitoring context other than up/down.
>
>
>

Re: How to query a slave node for monitoring information

Reply via email to