Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel


Depending on the nature of your jobs, Cascading has built in a  
topological scheduler. It will schedule all your work as their  
dependencies are satisfied. Dependencies being source data and inter- 
job intermediate data.


http://www.cascading.org

The first catch is that you will still need bash to start/stop your  
cluster and to start the cascading job (per your example below).


The second catch is that you currently must use the cascading api  (or  
the groovy api) to assemble your data processing flows. Hopefully in  
the next couple weeks we will have a means to support custom/raw  
hadoop jobs as members of a set of dependent jobs.


This feature is being delayed by our adding support for stream  
assertions, the ability to validate data during runtime but have the  
assertions 'planned' out of the process flow on demand, ie. for  
production runs.


And for stream traps, built in support for siphoning off bad data into  
side files so long running (or low fidelity) jobs can continue running  
without losing any data.


can read more about these features here
http://groups.google.com/group/cascading-user

ckw

On Jun 10, 2008, at 2:48 PM, Meng Mao wrote:

I'm interested in the same thing -- is there a recommended way to  
batch

Hadoop jobs together?

On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang [EMAIL PROTECTED] 


wrote:


Hello folks:
I am running several hadoop applications on hdfs. To save the  
efforts in
issuing the set of commands every time, I am trying to use bash  
script to
run the several applications sequentially. To let the job finishes  
before

it
is proceeding to the next job, I am using wait in the script like  
below.


sh bin/start-all.sh
wait
echo cluster start
(bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
test.randomwrite.bytes_per_map=107374182 rand)
wait
bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
test.randomtextwrite.total_bytes=107374182 rand-text
bin/stop-all.sh
echo finished hdfs randomwriter experiment


However, it always give the error like below. Does anyone have  
better idea

on how to run the multiple sequential jobs with bash script?

HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job  
tracker

still
initializing
  at
org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: 
1722)

  at
org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

  at org.apache.hadoop.ipc.Client.call(Client.java:557)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
  at $Proxy1.getNewJobId(Unknown Source)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at java.lang.reflect.Method.invoke(Method.java:597)
  at

org 
.apache 
.hadoop 
.io 
.retry 
.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

  at

org 
.apache 
.hadoop 
.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java: 
59)

  at $Proxy1.getNewJobId(Unknown Source)
  at  
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
973)

  at
org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at
org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at java.lang.reflect.Method.invoke(Method.java:597)
  at

org.apache.hadoop.util.ProgramDriver 
$ProgramDescription.invoke(ProgramDriver.java:68)

  at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
  at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at 

Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Ted Dunning
Just a quick plug for Cascading.  Our team uses cascading quite a bit and
found it to be a simpler way to write map reduce jobs.  The guys using it
find it very helpful.

On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel [EMAIL PROTECTED] wrote:


 Depending on the nature of your jobs, Cascading has built in a topological
 scheduler. It will schedule all your work as their dependencies are
 satisfied. Dependencies being source data and inter-job intermediate data.

 http://www.cascading.org





-- 
ted


RE: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Haijun Cao
Ted,

I find cascading very similar to pig, do you care to provide your comment here? 
If map reduce programmers are to go to the next level (scripting/query 
language), which way to go?

Thanks
Haijun 
 

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 2:16 PM
To: core-user@hadoop.apache.org
Subject: Re: does anyone have idea on how to run multiple sequential jobs with 
bash script

Just a quick plug for Cascading.  Our team uses cascading quite a bit and
found it to be a simpler way to write map reduce jobs.  The guys using it
find it very helpful.

On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel [EMAIL PROTECTED] wrote:


 Depending on the nature of your jobs, Cascading has built in a topological
 scheduler. It will schedule all your work as their dependencies are
 satisfied. Dependencies being source data and inter-job intermediate data.

 http://www.cascading.org





-- 
ted


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Arun C Murthy


On Jun 10, 2008, at 2:48 PM, Meng Mao wrote:

I'm interested in the same thing -- is there a recommended way to  
batch

Hadoop jobs together?



Hadoop Map-Reduce JobControl:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job 
+Control

and
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#JobControl


Arun

On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang  
[EMAIL PROTECTED]

wrote:


Hello folks:
I am running several hadoop applications on hdfs. To save the  
efforts in
issuing the set of commands every time, I am trying to use bash  
script to
run the several applications sequentially. To let the job finishes  
before

it
is proceeding to the next job, I am using wait in the script like  
below.


sh bin/start-all.sh
wait
echo cluster start
(bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
test.randomwrite.bytes_per_map=107374182 rand)
wait
bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
test.randomtextwrite.total_bytes=107374182 rand-text
bin/stop-all.sh
echo finished hdfs randomwriter experiment


However, it always give the error like below. Does anyone have  
better idea

on how to run the multiple sequential jobs with bash script?

HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job  
tracker

still
initializing
   at
org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: 
1722)

   at
org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

   at org.apache.hadoop.ipc.Client.call(Client.java:557)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
   at $Proxy1.getNewJobId(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at

org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod 
(RetryInvocationHandler.java:82)

   at

org.apache.hadoop.io.retry.RetryInvocationHandler.invoke 
(RetryInvocationHandler.java:59)

   at $Proxy1.getNewJobId(Unknown Source)
   at org.apache.hadoop.mapred.JobClient.submitJob 
(JobClient.java:696)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
973)

   at
org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at
org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at

org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke 
(ProgramDriver.java:68)

   at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
   at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
   at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)





--
hustlin, hustlin, everyday I'm hustlin




Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Ted Dunning
Pig is much more ambitious than cascading.  Because of the ambitions, simple
things got overlooked.  For instance, something as simple as computing a
file name to load is not possible in pig, nor is it possible to write
functions in pig.  You can hook to Java functions (for some things), but you
can't really write programs in pig.  On the other hand, pig may eventually
provide really incredible capabilities including program rewriting and
optimization that would be incredibly hard to write directly in Java.

The point of cascading was simply to make life easier for a normal
Java/map-reduce programmer.  It provides an abstraction for gluing together
several map-reduce programs and for doing a few common things like joins.
Because you are still writing Java (or Groovy) code, you have all of the
functionality you always had.  But, this same benefit costs you the future
in terms of what optimizations are likely to ever be possible.

The summary for us (especially 4-6 months ago when we were deciding) is that
cascading is good enough to use now and pig will probably be more useful
later.

On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao [EMAIL PROTECTED] wrote:


 I find cascading very similar to pig, do you care to provide your comment
 here? If map reduce programmers are to go to the next level (scripting/query
 language), which way to go?





RE: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Haijun Cao
Thanks for sharing. We have need to expose hadoop cluster to 'casual' users for 
ad-hoc query, I find it difficult to ask them to write map reduce program, pig 
latin comes in very handy in this case. However, for continuous production data 
processing, hadoop+cascading sounds like a good option. 

Haijun

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 5:01 PM
To: core-user@hadoop.apache.org
Subject: Re: does anyone have idea on how to run multiple sequential jobs with 
bash script

Pig is much more ambitious than cascading.  Because of the ambitions, simple
things got overlooked.  For instance, something as simple as computing a
file name to load is not possible in pig, nor is it possible to write
functions in pig.  You can hook to Java functions (for some things), but you
can't really write programs in pig.  On the other hand, pig may eventually
provide really incredible capabilities including program rewriting and
optimization that would be incredibly hard to write directly in Java.

The point of cascading was simply to make life easier for a normal
Java/map-reduce programmer.  It provides an abstraction for gluing together
several map-reduce programs and for doing a few common things like joins.
Because you are still writing Java (or Groovy) code, you have all of the
functionality you always had.  But, this same benefit costs you the future
in terms of what optimizations are likely to ever be possible.

The summary for us (especially 4-6 months ago when we were deciding) is that
cascading is good enough to use now and pig will probably be more useful
later.

On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao [EMAIL PROTECTED] wrote:


 I find cascading very similar to pig, do you care to provide your comment
 here? If map reduce programmers are to go to the next level (scripting/query
 language), which way to go?





Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel

Thanks Ted..

Couple quick comments.

At one level Cascading is a MapReduce query planner, just like PIG.  
Except the API is for public consumption and fully extensible, in PIG  
you typically interact with the PigLatin syntax. Subsequently, with  
Cascading, you can layer your own syntax on top of the API. Currently  
there is Groovy support (Groovy is used to assemble the work, it does  
not run on the mappers or reducers). I hear rumors about Jython  
elsewhere.


A couple groovy examples (note these are obviously trivial, the dsl  
can absorb tremendous complexity if need be)...

http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/wordcount.groovy
http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/widefinder.groovy

Since Cascading is in part a 'planner', it actually builds internally  
a new representation from what the developer assembled and renders  
out  the necessary map/reduce jobs (and transparently links them) at  
runtime. As Hadoop evolves, the planner will incorporate the new  
features and leverage them transparently. Plus there are opportunities  
for identifying patterns and applying different strategies  
(hypothetically map side vs reduce side joins, for one). It is also  
conceivable (but untried) that different planners can exist to target  
different systems other than Hadoop (making your code/libraries  
portable). Much of this is true for PIG as well.

http://www.cascading.org/documentation/overview.html

Also, Cascading will at some point provide a PIG adapter, allowing  
PigLatin queries to participate in a larger Cascading 'Cascade' (the  
topological scheduler). Cascading is great with integration,  
connecting things outside Hadoop with stuff to be done inside Hadoop.  
And PIG looks like a great way to concisely represent a complex  
solution and execute it. There isn't any reason they can't work  
together (it has always been the intention).


The takeaway is that with Cascading and PIG, users do not think in  
MapReduce. With PIG, you think in PigLatin. With Cascading, you can  
use the pipe/filter based API, or use your favorite scripting language  
and build a DSL for your problem domain.


Many companies have done similar things internally, but they tend to  
be nothing more than a scriptable way to write a map/reduce job and  
glue them together. You still think in MapReduce, which in my opinion  
doesn't scale well.


My (biased) recommendation is this.

Build out your application in Cascading. If part of the problem is  
best represented in PIG, no worries use PIG and feed and clean up  
after PIG with Cascading. And if you see a solvable bottleneck, and we  
can't convince the planner to recognize the pattern and plan better,  
replace that piece of the process with a custom MapReduce job (or more).


Solve your problem first, then optimize the solution, if need be.

ckw

On Jun 11, 2008, at 5:00 PM, Ted Dunning wrote:

Pig is much more ambitious than cascading.  Because of the  
ambitions, simple
things got overlooked.  For instance, something as simple as  
computing a

file name to load is not possible in pig, nor is it possible to write
functions in pig.  You can hook to Java functions (for some things),  
but you
can't really write programs in pig.  On the other hand, pig may  
eventually

provide really incredible capabilities including program rewriting and
optimization that would be incredibly hard to write directly in Java.

The point of cascading was simply to make life easier for a normal
Java/map-reduce programmer.  It provides an abstraction for gluing  
together
several map-reduce programs and for doing a few common things like  
joins.
Because you are still writing Java (or Groovy) code, you have all of  
the
functionality you always had.  But, this same benefit costs you the  
future

in terms of what optimizations are likely to ever be possible.

The summary for us (especially 4-6 months ago when we were deciding)  
is that
cascading is good enough to use now and pig will probably be more  
useful

later.

On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao [EMAIL PROTECTED]  
wrote:




I find cascading very similar to pig, do you care to provide your  
comment
here? If map reduce programmers are to go to the next level  
(scripting/query

language), which way to go?





--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel
However, for continuous production data processing, hadoop+cascading  
sounds like a good option.



This will be especially true with stream assertions and traps (as  
mentioned previously, and available in trunk). grin


I've written workloads for clients that render down to ~60 unique  
Hadoop map/reduce jobs, all inter-related, from ~10 unique units of  
work (internally lots of joins, sorts and math). I can't imagine  
having written them by hand.


ckw

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Richard Zhang
Hello folks:
I am running several hadoop applications on hdfs. To save the efforts in
issuing the set of commands every time, I am trying to use bash script to
run the several applications sequentially. To let the job finishes before it
is proceeding to the next job, I am using wait in the script like below.

sh bin/start-all.sh
wait
echo cluster start
(bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
test.randomwrite.bytes_per_map=107374182 rand)
wait
bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
test.randomtextwrite.total_bytes=107374182 rand-text
bin/stop-all.sh
echo finished hdfs randomwriter experiment


However, it always give the error like below. Does anyone have better idea
on how to run the multiple sequential jobs with bash script?

HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker still
initializing
at
org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722)
at
org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.getNewJobId(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.getNewJobId(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at
org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Meng Mao
I'm interested in the same thing -- is there a recommended way to batch
Hadoop jobs together?

On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang [EMAIL PROTECTED]
wrote:

 Hello folks:
 I am running several hadoop applications on hdfs. To save the efforts in
 issuing the set of commands every time, I am trying to use bash script to
 run the several applications sequentially. To let the job finishes before
 it
 is proceeding to the next job, I am using wait in the script like below.

 sh bin/start-all.sh
 wait
 echo cluster start
 (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
 test.randomwrite.bytes_per_map=107374182 rand)
 wait
 bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
 test.randomtextwrite.total_bytes=107374182 rand-text
 bin/stop-all.sh
 echo finished hdfs randomwriter experiment


 However, it always give the error like below. Does anyone have better idea
 on how to run the multiple sequential jobs with bash script?

 HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker
 still
 initializing
at
 org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722)
at
 org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.getNewJobId(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.getNewJobId(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at
 org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
 org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
 org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
 org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)




-- 
hustlin, hustlin, everyday I'm hustlin


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Edward Capriolo
wait and sleep are not what you are looking for. you can use 'nohup'
to run a job in the background and have its output piped to a file.

On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao [EMAIL PROTECTED] wrote:
 I'm interested in the same thing -- is there a recommended way to batch
 Hadoop jobs together?

 On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang [EMAIL PROTECTED]
 wrote:

 Hello folks:
 I am running several hadoop applications on hdfs. To save the efforts in
 issuing the set of commands every time, I am trying to use bash script to
 run the several applications sequentially. To let the job finishes before
 it
 is proceeding to the next job, I am using wait in the script like below.

 sh bin/start-all.sh
 wait
 echo cluster start
 (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
 test.randomwrite.bytes_per_map=107374182 rand)
 wait
 bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
 test.randomtextwrite.total_bytes=107374182 rand-text
 bin/stop-all.sh
 echo finished hdfs randomwriter experiment


 However, it always give the error like below. Does anyone have better idea
 on how to run the multiple sequential jobs with bash script?

 HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker
 still
 initializing
at
 org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722)
at
 org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.getNewJobId(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.getNewJobId(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at
 org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
 org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
 org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
 org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)




 --
 hustlin, hustlin, everyday I'm hustlin



Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Miles Osborne
You have another problem in that Hadoop is still initialising --this will
cause subsequent jobs to fail.

I've not yet migrated to 17.0 (I still use 16.3), but all my jobs are done
from nohuped scripts.  If you really want to check on the running status and
busy wait, you can look at the jobtracker log and poll it for when
everything is finished.

My turn to ask a question in the next post ..

Miles
2008/6/10 Richard Zhang [EMAIL PROTECTED]:

 Hello folks:
 I am running several hadoop applications on hdfs. To save the efforts in
 issuing the set of commands every time, I am trying to use bash script to
 run the several applications sequentially. To let the job finishes before
 it
 is proceeding to the next job, I am using wait in the script like below.

 sh bin/start-all.sh
 wait
 echo cluster start
 (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
 test.randomwrite.bytes_per_map=107374182 rand)
 wait
 bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
 test.randomtextwrite.total_bytes=107374182 rand-text
 bin/stop-all.sh
 echo finished hdfs randomwriter experiment


 However, it always give the error like below. Does anyone have better idea
 on how to run the multiple sequential jobs with bash script?

 HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker
 still
 initializing
at
 org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722)
at
 org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.getNewJobId(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.getNewJobId(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at
 org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
 org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
 org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
 org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.