Re: correct pattern for using setOutputValueGroupingComparator?

2009-01-06 Thread Meng Mao
>
> Not that's not possible without changing your app to generate that many
> records. So for example, in your map, you could output multiple records
> corresponding to the wild-card records..
>
Would it be sufficient to 'spam' a number of records that exceeded the
expected number of groups, and anticipating the extra records that some
groups would end up with?


Re: correct pattern for using setOutputValueGroupingComparator?

2009-01-05 Thread Meng Mao
Unfortunately, my team is on 0.15 :(. We are looking to upgrade to 0.18 as
soon as we upgrade our hardware (long story).
>From comparing the 0.15 and 0.19 mapreduce tutorials, and looking at the
4545 patch, I don't see anything that seems majorly different about the
MapReduce API?
- There's a Partitioner that's used, but that seems optional?
- I see that 0.19 still provides setOutputValueGroupingComparator; is the
setGroupingComparatorClass in the patch from the 0.20 API?

I have an associated question -- is it possible to use this
GroupingComparator technique to perform essentially a one-to-many mapping?
Let's say I have records like so:
id_1  -   metadata
id_2  -   metadata
id_1  A  numbers
id_2  B  numbers
id_1  C  numbers

Would it be possible for a key,value pair for <"id_1, -", metadata> to map
to both the groups for the keys "id_1, A" and "id_1, C" ?  The comparator
seems easy to achieve; but I don't see multiple copies of a record being
sent to multiple groups.  I know it's a bit unusual, but it would be useful
for us to have this kind of wildcard behavior.

Meng



On Mon, Jan 5, 2009 at 6:58 PM, Owen O'Malley  wrote:

> This is exactly what the setOutputValueGroupingComparator is for. Take a
> look at HADOOP-4545, for an example using the secondary sort. If you are
> using trunk or 0.20, look at
> src/examples/org/apache/hadoop/examples/SecondarySort.java. The checked in
> example uses the new map/reduce api that was introduced in 0.20.
>
> -- Owen
>


correct pattern for using setOutputValueGroupingComparator?

2009-01-05 Thread Meng Mao
I'm trying to use use map reduce to merge two classes of files, each class
using the same keys for grouping. An example:
class 1 input file:
id_1 A metadatum
id_2 A metadatum
id_1 A metadatum

class 2 input file:
id_1 B some numbers
id_1 B some numbers
id_2 B some numbers

I map using the first token, an id string, as the key. Ideally, the
intermediate input to the reducer class would be this (for the key id_1):
id_1 A metadatum
id_1 A metadatum
id_1 B some numbers
id_1 B some numbers

But because there's no guarantee on sorting for the values, we can see:
id_1 B some numbers
id_1 A metadatum
id_1 B some numbers
id_1 A metadatum


I was wondering if I could use setOutputValueGroupingComparator to force
records of the first class to sort to the top. I'm having a hard time
interpreting the documentation though:
If equivalence rules for grouping the intermediate keys are required to be
different from those for grouping keys before reduction, then one may
specify a Comparator via
JobConf.setOutputValueGroupingComparator(Class).
Since 
JobConf.setOutputKeyComparatorClass(Class)can
be used to control how intermediate keys are grouped, these can be
used
in conjunction to simulate *secondary sort on values*.

My interpretation is as follows:
--
class 1 input file:
id_1 A metadatum
id_1 A metadatum

class 2 input file:
id_1 B some numbers
id_2 B some numbers

Map with key = first column + delimiter + second column. Supply
setOutputKeyComparatorClass such that it only compares based on the first
half of the key. Supply setOutputValueGroupingComparator such that it only
compares based on the second half of the key. Thus, all keys like id_1* go
to the same group, and then it is sorted within that group with As first,
and then Bs (or reverse if needed).
--

Am I vastly overthinking how setOutputValueGroupingComparator works? I can't
tell from the docs if it is possible to peek at the values associated with
the pair of keys in each comparison. If it is, I probably wouldn't have to
use a different key as done in my interpretation.


Re: having different HADOOP_HOME for master and slaves?

2008-08-05 Thread Meng Mao
Is there any way for me to log and find out why the NameNode process is not
launching on the master?

On Mon, Aug 4, 2008 at 8:19 PM, Meng Mao <[EMAIL PROTECTED]> wrote:

> assumption -- if I run stop-all.sh _successfully_ on a Hadoop deployment
> (which means every node in the grid is using the same path to Hadoop), then
> that Hadoop installation becomes invisible, and then any other Hadoop
> deployment could start up and take its place on the grid. Let me know if
> this assumption is wrong.
>
> I was having a lot of grief trying to do a parallel, better permissioned
> Hadoop install the easy way, so I just went ahead and make copies on each
> node into the /new/dir location, and pointed hdfs.tmp.dir appropriately.
>
> So in a normal start-all.sh sequence, we have the following processes
> spawned:
> - master has NameNode, 2ndyNameNode, and JobTracker
> - worker has DataNode and TaskTracker
>
> After I powered down the normal Hadoop installation. I tried to
> start-all.sh mine. Again, everything with this Hadoop should point its home
> to /new/dir/hadoop, unless there's some deep hidden param I didn't know
> about. The processes I got were only
> - master: 2ndyNameNode, JobTracker
> - worker: TaskTracker
>
> Another hint is the error that calling the hadoop shell gives:
> $ bin/hadoop dfs -ls /
> 08/08/04 19:25:32 INFO ipc.Client: Retrying connect to server:
> master/ip:50001. Already tried 1 time(s).
> 08/08/04 19:25:33 INFO ipc.Client: Retrying connect to server: master
> /ip:50001. Already tried 2 time(s).
> 08/08/04 19:25:34 INFO ipc.Client: Retrying connect to server:
> master/ip:50001. Already tried 3 time(s).
>
> I can't for the life of me reason why the others are missing.
>
> On Mon, Aug 4, 2008 at 4:17 PM, Meng Mao <[EMAIL PROTECTED]> wrote:
>
>> I see. I think I could also modify the hadoop-env.sh in the new conf/
>> folders per datanode to point
>> to the right place for HADOOP_HOME.
>>
>>
>> On Mon, Aug 4, 2008 at 3:21 PM, Allen Wittenauer <[EMAIL PROTECTED]>wrote:
>>
>>>
>>>
>>>
>>> On 8/4/08 11:10 AM, "Meng Mao" <[EMAIL PROTECTED]> wrote:
>>> > I suppose I could, for each datanode, symlink things to point to the
>>> actual
>>> > Hadoop installation. But really, I would like the setup that is hinted
>>> as
>>> > possible by statement 1). Is there a way I could do it, or should that
>>> bit
>>> > of documentation read, "All machines in the cluster _must_ have the
>>> same
>>> > HADOOP_HOME?"
>>>
>>> If you run the -all scripts, they assume the location is the same.
>>> AFAIK, there is nothing preventing you from building your own -all
>>> scripts
>>> that point to the different location to start/stop the data nodes.
>>>
>>>
>>>
>>
>>
>> --
>> hustlin, hustlin, everyday I'm hustlin
>>
>
>
>
> --
> hustlin, hustlin, everyday I'm hustlin
>



-- 
hustlin, hustlin, everyday I'm hustlin


Re: having different HADOOP_HOME for master and slaves?

2008-08-04 Thread Meng Mao
assumption -- if I run stop-all.sh _successfully_ on a Hadoop deployment
(which means every node in the grid is using the same path to Hadoop), then
that Hadoop installation becomes invisible, and then any other Hadoop
deployment could start up and take its place on the grid. Let me know if
this assumption is wrong.

I was having a lot of grief trying to do a parallel, better permissioned
Hadoop install the easy way, so I just went ahead and make copies on each
node into the /new/dir location, and pointed hdfs.tmp.dir appropriately.

So in a normal start-all.sh sequence, we have the following processes
spawned:
- master has NameNode, 2ndyNameNode, and JobTracker
- worker has DataNode and TaskTracker

After I powered down the normal Hadoop installation. I tried to start-all.sh
mine. Again, everything with this Hadoop should point its home to
/new/dir/hadoop, unless there's some deep hidden param I didn't know about.
The processes I got were only
- master: 2ndyNameNode, JobTracker
- worker: TaskTracker

Another hint is the error that calling the hadoop shell gives:
$ bin/hadoop dfs -ls /
08/08/04 19:25:32 INFO ipc.Client: Retrying connect to server:
master/ip:50001. Already tried 1 time(s).
08/08/04 19:25:33 INFO ipc.Client: Retrying connect to server: master
/ip:50001. Already tried 2 time(s).
08/08/04 19:25:34 INFO ipc.Client: Retrying connect to server:
master/ip:50001. Already tried 3 time(s).

I can't for the life of me reason why the others are missing.

On Mon, Aug 4, 2008 at 4:17 PM, Meng Mao <[EMAIL PROTECTED]> wrote:

> I see. I think I could also modify the hadoop-env.sh in the new conf/
> folders per datanode to point
> to the right place for HADOOP_HOME.
>
>
> On Mon, Aug 4, 2008 at 3:21 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:
>
>>
>>
>>
>> On 8/4/08 11:10 AM, "Meng Mao" <[EMAIL PROTECTED]> wrote:
>> > I suppose I could, for each datanode, symlink things to point to the
>> actual
>> > Hadoop installation. But really, I would like the setup that is hinted
>> as
>> > possible by statement 1). Is there a way I could do it, or should that
>> bit
>> > of documentation read, "All machines in the cluster _must_ have the same
>> > HADOOP_HOME?"
>>
>> If you run the -all scripts, they assume the location is the same.
>> AFAIK, there is nothing preventing you from building your own -all scripts
>> that point to the different location to start/stop the data nodes.
>>
>>
>>
>
>
> --
> hustlin, hustlin, everyday I'm hustlin
>



-- 
hustlin, hustlin, everyday I'm hustlin


Re: having different HADOOP_HOME for master and slaves?

2008-08-04 Thread Meng Mao
I see. I think I could also modify the hadoop-env.sh in the new conf/
folders per datanode to point
to the right place for HADOOP_HOME.

On Mon, Aug 4, 2008 at 3:21 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:

>
>
>
> On 8/4/08 11:10 AM, "Meng Mao" <[EMAIL PROTECTED]> wrote:
> > I suppose I could, for each datanode, symlink things to point to the
> actual
> > Hadoop installation. But really, I would like the setup that is hinted as
> > possible by statement 1). Is there a way I could do it, or should that
> bit
> > of documentation read, "All machines in the cluster _must_ have the same
> > HADOOP_HOME?"
>
> If you run the -all scripts, they assume the location is the same.
> AFAIK, there is nothing preventing you from building your own -all scripts
> that point to the different location to start/stop the data nodes.
>
>
>


-- 
hustlin, hustlin, everyday I'm hustlin


having different HADOOP_HOME for master and slaves?

2008-08-04 Thread Meng Mao
I'm trying to set up 2 Hadoop installations on my master node, one of which
will have permissions that allow more users to run Hadoop.
But I don't really need anything different on the datanodes, so I'd like to
keep those as-is. With that switch, the HADOOP_HOME on the master will be
different from that on the datanodes.

After shutting down the old hadoop, I tried to start-all the new one, and
encountered this:
$ bin/stop-all.sh
no jobtracker to stop
node2: bash: line 0: cd: /new/dir/hadoop/bin/..: No such file or directory
node2: bash: /new/dir/hadoop/bin/hadoop-daemon.sh: No such file or directory

I consulted the documentation at:
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Installation
which only has 2 bits of info on this --
1) "The root of the distribution is referred to as HADOOP_HOME. All machines
in the cluster usually have the same HADOOP_HOME path."
and
2) "Once all the necessary configuration is complete, distribute the files
to the HADOOP_CONF_DIR directory on all the machines, typically
${HADOOP_HOME}/conf."

So I forgot to do anything about the second instruction. After doing so, I
got:
$ bin/stop-all.sh
no jobtracker to stop
node2: bash: /new/dir/hadoop/bin/hadoop-daemon.sh: No such file or directory

Ok, it found the config dir, but now it expects the binary to be located at
the same HADOOP_HOME that the master uses?

I suppose I could, for each datanode, symlink things to point to the actual
Hadoop installation. But really, I would like the setup that is hinted as
possible by statement 1). Is there a way I could do it, or should that bit
of documentation read, "All machines in the cluster _must_ have the same
HADOOP_HOME?"

Thanks!


Re: best command line way to check up/down status of HDFS?

2008-07-02 Thread Meng Mao
I realy like method 3.

I am doing sceenscraping of the jobtracker JSP page, but I thought that was
only a partial solution, since the format of the page could change at any
moment, and because it's potentially much more computationally intensive,
depending on how much information I want to extract. One thing I thought of
would be to create a custom 'naked' JSP that has very little formatting.

On Wed, Jul 2, 2008 at 6:19 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:

> Meng Mao wrote:
>
>> For a Nagios script I'm writing, I'd like a command-line method that
>> checks
>> if HDFS is up and running.
>> Is there a better way than to attempt a hadoop dfs command and check the
>> error code?
>>
>
> 1. There is JMX support built in to Hadoop. If you can bring up Hadoop
> running a JMX agent that is compatible with Nagios, you can keep a close eye
> on the internals.
>
> 2.. I'm making some lifecycle changes to Hadoop; if/when accepted every
> service (name,data, job,...) will have an internal ping() operation to check
> their health -this can be checked in-process only. I'm also adding the
> smartfrog support to do that in-processing pinging, fallback etc; I dont
> know how nagios would work there, but JMX support for these ops should also
> be possible.
>
> 3. When a datanode comes up it starts jetty on a specific port -you can do
> a GET against that jetty instance to see if it is responding. This is a good
> test as it really does verify that the service is live and responding.
> Indeed, that is the official definition of "liveness", at least according to
> Lamport.
>  * review the code to make sure it turns caching off, or you can be burned
> probing for health long hall, seeing the happy page and thinking all is
> well. I forgot to do that in happyaxis.jsp, which is why axis 1.x health
> checks dont work long-haul.
>  * I could imagine improving those pages with better ones, like something
> that checks that the available freespace is within a certain range, and
> returns an error code if there is less, e.g.
>  http://datanode7:5000/checkDiskSpace?mingb=1500
> would test for a min disk space of 1500GB.
>
> There are also web pages for job trackers & the like; better for remote
> health checking than jps checks. JPS (and killall) is better for fallback
> when the things stop responding, but  not adequate for liveness checks.
>
>


-- 
hustlin, hustlin, everyday I'm hustlin


Re: best command line way to check up/down status of HDFS?

2008-06-27 Thread Meng Mao
I was thinking of checking for both independently, and taking a logical OR.
Would that be sufficient?

I'm trying to avoid file reading if possible. Not that reading through a log
is that intensive,
but it'd be cleaner if I could poll either Hadoop itself or inspect the
processes running.

On Fri, Jun 27, 2008 at 1:23 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:

> that won't work since the namenode may be down, but the secondary namenode
> may be up instead
>
> why not instead just look at the respective logs?
>
> Miles
>
> 2008/6/27 Meng Mao <[EMAIL PROTECTED]>:
>
> > Is running:
> > ps aux | grep [\\.]NameNode
> >
> > and looking for a non empty response a good way to test HDFS up status?
> >
> > I'm assuming that if the NameNode process is down, then DFS is definitely
> > down?
> > Worried that there'd be frequent cases of DFS being messed up but the
> > process still running just fine.
> >
> > On Fri, Jun 27, 2008 at 10:48 AM, Meng Mao <[EMAIL PROTECTED]> wrote:
> >
> > > For a Nagios script I'm writing, I'd like a command-line method that
> > checks
> > > if HDFS is up and running.
> > > Is there a better way than to attempt a hadoop dfs command and check
> the
> > > error code?
> > >
> >
> >
> >
> > --
> > hustlin, hustlin, everyday I'm hustlin
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>



-- 
hustlin, hustlin, everyday I'm hustlin


Re: best command line way to check up/down status of HDFS?

2008-06-27 Thread Meng Mao
Is running:
ps aux | grep [\\.]NameNode

and looking for a non empty response a good way to test HDFS up status?

I'm assuming that if the NameNode process is down, then DFS is definitely
down?
Worried that there'd be frequent cases of DFS being messed up but the
process still running just fine.

On Fri, Jun 27, 2008 at 10:48 AM, Meng Mao <[EMAIL PROTECTED]> wrote:

> For a Nagios script I'm writing, I'd like a command-line method that checks
> if HDFS is up and running.
> Is there a better way than to attempt a hadoop dfs command and check the
> error code?
>



-- 
hustlin, hustlin, everyday I'm hustlin


best command line way to check up/down status of HDFS?

2008-06-27 Thread Meng Mao
For a Nagios script I'm writing, I'd like a command-line method that checks
if HDFS is up and running.
Is there a better way than to attempt a hadoop dfs command and check the
error code?


Re: getting hadoop job status/progress outside of hadoop

2008-06-17 Thread Meng Mao
What if I'm not interested in which job is running but simply whether the
current job is not stalled or failed?
Is there a way I can avoid specifying a job by the job ID?
I apologize if there's some commandline documentation I'm missing,
but the commands change a bit from point version to version.

On Tue, Jun 17, 2008 at 1:41 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:

> To get this from some other application rather than Hadoop, you  just need
> to run this within a shell (I do this kind of thing within perl)
>
> Miles
>
> 2008/6/17 Miles Osborne <[EMAIL PROTECTED]>:
>
> > try this:
> >
> > hadoop job  -Dmapred.job.tracker=hermitage:9001 -status
> > job_200806160820_0430
> >
> > (and replace my job id with the one you want to track):
> > >
> > hadoop job  -Dmapred.job.tracker=hermitage:9001 -status
> > job_200806160820_0430
> >
> > Job: job_200806160820_0430
> > file: /data/tmp/hadoop/mapred/system/job_200806160820_0430/job.xml
> > tracking URL:
> > http://hermitage:50030/jobdetails.jsp?jobid=job_200806160820_0430
> > map() completion: 1.0
> > reduce() completion: 0.20370372
> > >
> >
> > Miles
> >
> > 2008/6/17 Kayla Jay <[EMAIL PROTECTED]>:
> >
> >
> >>
> >> Hi
> >>
> >> Is there a way to grab a hadoop job's status/progress outside of the job
> >> and outside of hadoop?
> >> I.e if I have another application running and  this application needs to
> >> know that a job has ended or the status percentage while the job is
> running,
> >> how can an external app like  this get status from the hadoop job or
> cluster
> >> that the job is done and the progress while it's running?
> >>
> >> Is there a hook into the status via HTTP or any other interface?  How
> can
> >> external apps get progress of the job running and notification when it's
> >> done running?  I was thinking there might be a hook in since it reports
> it
> >> via the JobTracker.
> >>
> >> Thanks.
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> > --
> > The University of Edinburgh is a charitable body, registered in Scotland,
> > with registration number SC005336.
>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>



-- 
hustlin, hustlin, everyday I'm hustlin


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Meng Mao
I'm interested in the same thing -- is there a recommended way to batch
Hadoop jobs together?

On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang <[EMAIL PROTECTED]>
wrote:

> Hello folks:
> I am running several hadoop applications on hdfs. To save the efforts in
> issuing the set of commands every time, I am trying to use bash script to
> run the several applications sequentially. To let the job finishes before
> it
> is proceeding to the next job, I am using wait in the script like below.
>
> sh bin/start-all.sh
> wait
> echo cluster start
> (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
> test.randomwrite.bytes_per_map=107374182 rand)
> wait
> bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
> test.randomtextwrite.total_bytes=107374182 rand-text
> bin/stop-all.sh
> echo finished hdfs randomwriter experiment
>
>
> However, it always give the error like below. Does anyone have better idea
> on how to run the multiple sequential jobs with bash script?
>
> HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell
>
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker
> still
> initializing
>at
> org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722)
>at
> org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>
>at org.apache.hadoop.ipc.Client.call(Client.java:557)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>at $Proxy1.getNewJobId(Unknown Source)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>at $Proxy1.getNewJobId(Unknown Source)
>at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
>at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
>at
> org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at
> org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>at
> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
>



-- 
hustlin, hustlin, everyday I'm hustlin


checking per-node health (jobs, tasks, failures)?

2008-06-04 Thread Meng Mao
I'm trying to implement Nagios health monitoring of a Hadoop grid.
If anyone has general tips to share, those would be welcome, too.
For those who don't know, Nagios is monitoring software that organizes and
manages checking of services.

As best as I know, the easiest, most decoupled way to monitor the grid is to
use a script to parse the jobtracker and tasktracker JSPs that are served
when the Hadoop instance is running.

My original implementation was 1 script that pointed to the 2 jsps on the
primary namenode. However, this led to serious performance hangups from
Nagios' bombarding the primary node with frequent checks. To fix this, I'd
like to distribute the script to each Hadoop datanode, so that Nagios is
polling each node directly, instead of always going through the primary node
and making it do all of the work for the whole grid.

The problem is with job info. I can't think of a way to ask a datanode for
this, since it doesn't serve the jobtracker.jsp. Only the namenode serves
that jsp.

Is there 1) a better way to get this info? I'm scripting in perl, so writing
a custom jar to find out things would be rather convoluted. 2) a
straightforward way to get job status from a namenode directly?

Thanks!