RE: Starting a Hadoop job programtically

praveen.peddi Wed, 24 Nov 2010 09:52:12 -0800

Hi Henning,
Thanks again.

Let me explain my scenario first so you make a better sense out of my question 
. I have a web application running on glassfish server. Every 24 hours Quartz 
job runs on the server and I need to call set of Hadoop jobs one after the 
other, read the final output and store in a database. I have the logic of 
starting the jobs in a Driver class that I can call from the Quartz job.


Now my questions:
1. Where does the multi jar Vs single jar come into the picture here (since my 
driver class is residing on glassfish server)
2. Given the above scenario, could you suggest a workable solution (if you are 
already doing something similar).

BTW I am able to run the driver class from Hadoop command line using "one jar" 
approach.

Thanks
Praveen

________________________________
From: ext Henning Blohm [mailto:henning.bl...@zfabrik.de]
Sent: Wednesday, November 24, 2010 3:38 AM
To: mapreduce-user@hadoop.apache.org
Subject: RE: Starting a Hadoop job programtically

Hi Praveen,

looking at the Job configuration you will find properties like user.name and 
more stuff that has created by substituting template values in 
core-default.xml, mapred-default.xml (all in the hadoop jars). I suppose on of 
these (if not user.name) define the user that submits. But I haven't tried and 
I am sure others know better.

Why is that actually important? Why not submit as the user you are?

About submitting multiple jars: AFAIK the standard way is to submit everything 
in one jar.

Henning

ps.: We are developing something based on 
www.z2-environment.eu<http://www.z2-environment.eu> that will complement Hadoop 
with automatic on-demand update on the task node. But it's not public yet.


On Wed, 2010-11-24 at 00:10 +0100, praveen.pe...@nokia.com wrote:
Hi Henning,
Putting core-site.xml in classpath worked. Thanks for the help. I need to 
figure how to submit a job as a different user than the user hadoop is 
configured for.
I have one more related to job submission. Did anyone face problem with running 
job that involves multiple jar files. I am running a map reduce job that 
references multiple jar files. When I run the job I always get 
ClassNotFoundException on the class that is not in the jar file that job class 
is present.
I am starting the jobs from a java application and am getting 
ClassNotFoundException.
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
com.nokia.relevancy.util.hadoop.ValueOnlyTextOutputFormat
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
        at 
org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:193)
        at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.ClassNotFoundException: 
com.nokia.relevancy.util.hadoop.ValueOnlyTextOutputFormat
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:807)
        ... 4 more

Praveen
________________________________
From: ext Henning Blohm [mailto:henning.bl...@zfabrik.de]
Sent: Tuesday, November 23, 2010 11:37 AM
To: mapreduce-user@hadoop.apache.org
Subject: RE: Starting a Hadoop job programtically



Hi Praveen,

On Tue, 2010-11-23 at 17:18 +0100, praveen.pe...@nokia.com wrote:
Hi Henning,
adding hadoop's conf folder didn't help fixing the issue but when I added the 
two below properties, I was able to access file system but cannot write 
anything due to different user. I have following questions based on experiments.

Exaclty. I didn't mean to add the whole folder. Just the one file with those 
props.

1. How can I access HDFS or submit jobs as different user than my java app is 
running. For example, Hadoop cluster is setup for "hadoop" user and my java app 
is runnign as different user. In order to run the job correctly, I have to 
submit it as "hadoop" user. correct? How to achive it programitcally?

We always run everything with the same user (now that you mention it). Didn't 
know that we would have a problem otherwise. I would have suspected that the 
submitting user doesn't matter (setting the corresponding system property would 
probably override that one anyway).

2. Few of the jobs I am calling is provided by the library which means I cannot 
add these two config properties myself. Is there any way around this other than 
replicating the job submission code from the library to locally?

Yes, I think creating a core-site.xml file as below, putting it into <folder> 
(any folder you like will do) and adding <folder> to your classpath when 
submitting should do the trick (as I tried to explain before and if I am not 
mistaken).

Thanks
Praveen

Good luck,
  Henning


________________________________


From: ext Henning Blohm [mailto:henning.bl...@zfabrik.de]
Sent: Tuesday, November 23, 2010 3:24 AM
To: mapreduce-user@hadoop.apache.org
Subject: RE: Starting a Hadoop job programtically



Hi Praveen,

  in order to submit it to the cluster, you just need to have a core-site.xml 
on your classpath (or load it explicitly into your configuration object) that 
looks (at least) like this

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://${name:port of namenode}</value>
</property>

<property>
<name>mapred.job.tracker</name>
<value>${name:port of jobtracker}</value>
</property>
</configuration>

If you want to wait for each job's completion, you can use 
job.waitForCompletion(true) rather than job.submit().

Good luck,
  henning


On Mon, 2010-11-22 at 23:40 +0100, praveen.pe...@nokia.com wrote:
Hi Thanks for your reply. In my case I have a Driver that calls multiple jobs 
one after the other. I am using the following code to submit each job but it 
uses local hadoop jar files that is in the classpath. Its not submitting the 
job to Hadoop cluster. I thought I would need to specify where the master 
Hadoop is located on remote machine. Example command I use from command line is 
as follows but I need to do it from my Java program.
$ hadoop-0.20.2/bin/hadoop jar 
/home/ppeddi/dev/Merchandising/RelevancyEngine/relevancy-core/dist/Relevancy4.jar
 -i raw-downloads-input-10K -o reco-patterns-output-10K-1S -k 100 -method 
mapreduce -g 500 -regex '[\ ]' -s 5


I hope I made the question clear now.
Praveen

________________________________



From: ext Henning Blohm [mailto:henning.bl...@zfabrik.de]
Sent: Monday, November 22, 2010 5:07 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Starting a Hadoop job programtically



Hi Praveen,

  we do. We are using the "new" org.apache.hadoop.mapreduce.* API in Hadoop 
0.20.2.

  Essentially the flow is:

  //----
  // assuming all config is on the class path
  Configuration config = new Configuration();
  Job job = new Job(config, "some job name");

  // set in/out types
  job.setInputFormatClass(...);
  job.setOutputFormatClass(...);
  job.setMapOutputKeyClass(...);
  job.setMapOutputValueClass(...);
  job.setOutputKeyClass(...);
  job.setOutputValueClass(...);

  // set implementations as required
  job.setMapperClass(<your mapper implementation class object>);
  job.setCombinerClass(<your combiner implementation class object>);
  job.setReducerClass(<your reducer implementation class object>);

  // set the jar... this is often the tricky part!
  job.setJarByClass(<some class that is in the job jar and not elsewhere higher 
up on the class path>);

  job.submit();
  //----

Hope I didn't forget anything.

Note: You need to give Hadoop something it can launch in a JVM that has no more 
but the hadoop jars and whatever else you
configured statically in your hadoop-env.sh script.

Can you describe your scenario in more detail?

Henning


Am Montag, den 22.11.2010, 22:39 +0100 schrieb praveen.pe...@nokia.com:
Hi all,
I am trying to figure how I can start a hadoop job porgramatically from my Java 
application running in an app server. I was able to run my map reduce job using 
hadoop command from hadoop master machine but my goal is to run the same job 
from my java program (running on a different machine than master). I googled 
and could not find solution for this. All the examples I have seen so far are 
using hadoop from command line to start a job.
1. Has anyone called Hadoop job invocation from a Java application?
2. If so, could someone provide some sample code.
3.
Thanks
Praveen

Henning Blohm

ZFabrik Software KG

henning.bl...@zfabrik.de<mailto:henning.bl...@zfabrik.de>
www.z2-environment.eu<http://www.z2-environment.eu>

RE: Starting a Hadoop job programtically

Reply via email to