Thank you very much Allen and Nathan for your help. I will follow your suggestions and check your pointers.
Thank you -sam ________________________________ From: Nathan Milford <nat...@milford.io> To: mapreduce-user@hadoop.apache.org Sent: Tue, July 12, 2011 6:21:07 PM Subject: Re: How to query a slave node for monitoring information If you wanna be kinda ghetto about it on a 5 node test cluster you could: if curl namenode:50070; then echo "YAY it's up" else echo "It's down" restartNamenodeCommand fi But, if you have a proper cluster you're better off using Nagios or something similar. If you want down/up information in Nagios: service_description NameNode check_command check_http!50070 service_description JobTracker check_command check_http!50030 service_description TaskTracker check_command check_http!50060 service_description Secondary NameNode check_command check_http!50090 service_description DataNode check_command check_http!50075 If you want metrics for thresholds in Nagios: Modify hadoop-metrics.properties to expose the /metrics URL and run something like: http://exchange.nagios.org/directory/Plugins/Others/check_hadoop_metrics/details Similar to what Allen suggested, we also have a script the scrapes the NameNode and JobTracker pages and gets the number of nodes reporting and alerts if we fall below a threshold. - Nathan Milford On Tue, Jul 12, 2011 at 7:34 PM, <samdispmail-tru...@yahoo.com> wrote: Thank you very much Allen, > > >"common-user@ would likely have been better, but I'm too lazy to forward you >there today. :)" >Thank you :-) > > >"Do you want monitoring information or metrics information? " >I need monitoring information. >I am working on deploying Hadoop on a small cluster. For now, I am interested >in >restarting (restart the node or even reboot the OS) the nodes Hadoop detects >as >crashed. > >"Instead, one should monitor the namenode and jobtracker and alert based on a >percentage of availability. ... " >Indeed. >I use Hadoop 0.20.203. > >"This can be done in a variety of ways, ..." >Can you please provide any pointers. >Do you know how I can access the monitoring information of the namenode or the >jobtracker so I can extract a list of failed nodes? > >Thank you very much for your help > >P.S.: >Why I thought of using metrics information, is because they are periodic and >seemed easy to access. I though of using them as heart beats only (i.e. if I >do >not receive the metric in 2-3 periods I reset the node). > >Thank you > >-sam > > > ________________________________ From: Allen Wittenauer <a...@apache.org> >To: mapreduce-user@hadoop.apache.org >Sent: Tue, July 12, 2011 3:13:42 PM >Subject: Re: How to query a slave node for monitoring information > > >On Jul 12, 2011, at 3:02 PM, <samdispmail-tru...@yahoo.com> ><samdispmail-tru...@yahoo.com> wrote: >> I am new to Hadoop, and I apologies if this was answered before, or if this >> is > >> not the right list for my question. > > common-user@ would likely have been better, but I'm too lazy to forward > you >there today. :) > >> >> I am trying to do the following: >> 1- Read monitoring information from slave nodes in hadoop >> 2- Process the data to detect nodes failure (node crash, problems in >> requests > >> ... etc) and decide if I need to restart the whole machine. >> 3- Restart the machine running the slave facing problems > > > At scale, one doesn't monitor individual nodes for up/down. Verifying > the >up/down of a given node will drive you insane and is pretty much a waste of >time >unless the grid itself is under-configured to the point that *every* *node* >*counts*. (If that is the case, then there are bigger issues afoot...) > > Instead, one should monitor the namenode and jobtracker and alert based on > a >percentage of availability. This can be done in a variety of ways, depending >upon which version of Hadoop is in play. For 0.20.2, a simple screen scrape >is >good enough. I recommend warn on 10%, alert on 20%, panic on 30%. > >> My question is for step 1- collecting monitoring information. >> I have checked Hadoop monitoring features. But currently you can forward >> the >> motioning data to files, or to Ganglia. > > > Do you want monitoring information or metrics information? Ganglia is >purely a metrics tool. Metrics are a different animal. While it is possible >to alert on them, in most cases they aren't particular useful in a monitoring >context other than up/down. > > >