I realy like method 3. I am doing sceenscraping of the jobtracker JSP page, but I thought that was only a partial solution, since the format of the page could change at any moment, and because it's potentially much more computationally intensive, depending on how much information I want to extract. One thing I thought of would be to create a custom 'naked' JSP that has very little formatting.
On Wed, Jul 2, 2008 at 6:19 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > Meng Mao wrote: > >> For a Nagios script I'm writing, I'd like a command-line method that >> checks >> if HDFS is up and running. >> Is there a better way than to attempt a hadoop dfs command and check the >> error code? >> > > 1. There is JMX support built in to Hadoop. If you can bring up Hadoop > running a JMX agent that is compatible with Nagios, you can keep a close eye > on the internals. > > 2.. I'm making some lifecycle changes to Hadoop; if/when accepted every > service (name,data, job,...) will have an internal ping() operation to check > their health -this can be checked in-process only. I'm also adding the > smartfrog support to do that in-processing pinging, fallback etc; I dont > know how nagios would work there, but JMX support for these ops should also > be possible. > > 3. When a datanode comes up it starts jetty on a specific port -you can do > a GET against that jetty instance to see if it is responding. This is a good > test as it really does verify that the service is live and responding. > Indeed, that is the official definition of "liveness", at least according to > Lamport. > * review the code to make sure it turns caching off, or you can be burned > probing for health long hall, seeing the happy page and thinking all is > well. I forgot to do that in happyaxis.jsp, which is why axis 1.x health > checks dont work long-haul. > * I could imagine improving those pages with better ones, like something > that checks that the available freespace is within a certain range, and > returns an error code if there is less, e.g. > http://datanode7:5000/checkDiskSpace?mingb=1500 > would test for a min disk space of 1500GB. > > There are also web pages for job trackers & the like; better for remote > health checking than jps checks. JPS (and killall) is better for fallback > when the things stop responding, but not adequate for liveness checks. > > -- hustlin, hustlin, everyday I'm hustlin