Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

Yaozhen Pan Fri, 01 Jul 2011 21:11:32 -0700

Narayanan,

Regarding the client installation, you should make sure that client and
server use same version hadoop for submitting jobs and transfer data.
if you use a different user in client than the one runs hadoop job, config
the hadoop ugi property (sorry i forget the exact name).


在 2011 7 1 15:28，"Narayanan K" <knarayana...@gmail.com>写道：
> Hi Harsh
>
> Thanks for the quick response...
>
> Have a few clarifications regarding the 1st point :
>
> Let me tell the background first..
>
> We have actually set up a Hadoop cluster with HBase installed. We are
> planning to load Hbase with data and perform some
> computations with the data and show up the data in a report format.
> The report should be accessible from outside the cluster and the report
> accepts certain parameters to show data, that will in turn pass on these
> parameters to the hadoop master server where a mapreduce job will be run
> that queries HBase to retrieve the data.
>
> So the report will be run from a different machine outside the cluster. So
> we need a way to pass on the parameters to the hadoop cluster (master) and
> initiate a mapreduce job dynamically. Similarly the output of mapreduce
job
> needs to tunneled into the machine from where the report was run.
>
> Some more clarification I need is : Does the machine (outside of cluster)
> which ran the report, require something like a Client installation which
> will talk with the Hadoop Master Server via TCP??? Or can it can run a job
> in hadoop server by using a passworldless scp to the master machine or
> something of the like.
>
>
> Regards,
> Narayanan
>
>
>
>
> On Fri, Jul 1, 2011 at 11:41 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Narayanan,
>>
>>
>> On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K <knarayana...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > We are basically working on a research project and I require some help
>> > regarding this.
>>
>> Always glad to see research work being done! What're you working on? :)
>>
>> > How do I submit a mapreduce job from outside the cluster i.e from a
>> > different machine outside the Hadoop cluster?
>>
>> If you use Java APIs, use the Job#submit(…) method and/or
>> JobClient.runJob(…) method.
>> Basically Hadoop will try to create a jar with all requisite classes
>> within and will push it out to the JobTracker's filesystem (HDFS, if
>> you run HDFS). From there on, its like a regular operation.
>>
>> This even happens on the Hadoop nodes itself, so doing so from an
>> external place as long as that place has access to Hadoop's JT and
>> HDFS, should be no different at all.
>>
>> If you are packing custom libraries along, don't forget to use
>> DistributedCache. If you are packing custom MR Java code, don't forget
>> to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
>> API methods.
>>
>> > If the above can be done, How can I schedule map reduce jobs to run in
>> > hadoop like crontab from a different machine?
>> > Are there any webservice APIs that I can leverage to access a hadoop
>> cluster
>> > from outside and submit jobs or read/write data from HDFS.
>>
>> For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
>> It is well supported and is very useful in writing MR workflows (which
>> is a common requirement). You also get coordinator features and can
>> schedule similar to crontab functionalities.
>>
>> For HDFS r/w over web, not sure of an existing web app specifically
>> for this purpose without limitations, but there is a contrib/thriftfs
>> you can leverage upon (if not writing your own webserver in Java, in
>> which case its as simple as using HDFS APIs).
>>
>> Also have a look at the pretty mature Hue project which aims to
>> provide a great frontend that lets you design jobs, submit jobs,
>> monitor jobs and upload files or browse the filesystem (among several
>> other things): http://cloudera.github.com/hue/
>>
>> --
>> Harsh J
>>

Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

Reply via email to