Hi Harsh Thanks for the quick response...
Have a few clarifications regarding the 1st point : Let me tell the background first.. We have actually set up a Hadoop cluster with HBase installed. We are planning to load Hbase with data and perform some computations with the data and show up the data in a report format. The report should be accessible from outside the cluster and the report accepts certain parameters to show data, that will in turn pass on these parameters to the hadoop master server where a mapreduce job will be run that queries HBase to retrieve the data. So the report will be run from a different machine outside the cluster. So we need a way to pass on the parameters to the hadoop cluster (master) and initiate a mapreduce job dynamically. Similarly the output of mapreduce job needs to tunneled into the machine from where the report was run. Some more clarification I need is : Does the machine (outside of cluster) which ran the report, require something like a Client installation which will talk with the Hadoop Master Server via TCP??? Or can it can run a job in hadoop server by using a passworldless scp to the master machine or something of the like. Regards, Narayanan On Fri, Jul 1, 2011 at 11:41 AM, Harsh J <ha...@cloudera.com> wrote: > Narayanan, > > > On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K <knarayana...@gmail.com> > wrote: > > Hi all, > > > > We are basically working on a research project and I require some help > > regarding this. > > Always glad to see research work being done! What're you working on? :) > > > How do I submit a mapreduce job from outside the cluster i.e from a > > different machine outside the Hadoop cluster? > > If you use Java APIs, use the Job#submit(…) method and/or > JobClient.runJob(…) method. > Basically Hadoop will try to create a jar with all requisite classes > within and will push it out to the JobTracker's filesystem (HDFS, if > you run HDFS). From there on, its like a regular operation. > > This even happens on the Hadoop nodes itself, so doing so from an > external place as long as that place has access to Hadoop's JT and > HDFS, should be no different at all. > > If you are packing custom libraries along, don't forget to use > DistributedCache. If you are packing custom MR Java code, don't forget > to use Job#setJarByClass/JobClient#setJarByClass and other appropriate > API methods. > > > If the above can be done, How can I schedule map reduce jobs to run in > > hadoop like crontab from a different machine? > > Are there any webservice APIs that I can leverage to access a hadoop > cluster > > from outside and submit jobs or read/write data from HDFS. > > For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/ > It is well supported and is very useful in writing MR workflows (which > is a common requirement). You also get coordinator features and can > schedule similar to crontab functionalities. > > For HDFS r/w over web, not sure of an existing web app specifically > for this purpose without limitations, but there is a contrib/thriftfs > you can leverage upon (if not writing your own webserver in Java, in > which case its as simple as using HDFS APIs). > > Also have a look at the pretty mature Hue project which aims to > provide a great frontend that lets you design jobs, submit jobs, > monitor jobs and upload files or browse the filesystem (among several > other things): http://cloudera.github.com/hue/ > > -- > Harsh J >