Harsh,

Thanks.

(1) It seems I need more shell knowledge here. For example, in order to
submit two jobs at the same time, I have a shell script batch.sh with just
three lines:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' &
bin/hadoop jar hadoop-examples-*.jar grep input output2 'dfs[a-z.]+' &
bin/hadoop jar hadoop-examples-*.jar grep input output3 'dfs[a-z.]+' &

Is this what you mean by fork? Could you please give a short example to
show how to use "$!" for monitoring? I could then google and learn it
further myself.

(2) sorry for the confusion here. Do you think it's feasible to control the
interval of job submission in a shell script (or python etc.), if I don't
change the java code? For instance, in the above example, can I set the 2rd
job runs  1 second after the submission of the 1st job, and then submit the
last job 5 seconds after the second job?

(3) by the way, i also tried to modify the java code to run several grep
jobs in hadoop-examples-*.jar, so I modified the Grep.java in
src/examples/org/apache/hadoop/examples as follows and named the new one
as EthanGrep.java:
I remove the main function and use JobClient.submitJob(grepJob); to replace
JobClient.runJob(grepJob); in run(String[] args).

Then I  wrote a EthanTest.java:
public class EthanTest {

  public static void main(String[] args) throws Exception {

  for (int x = 0; x < 2; x++) {
    int res = ToolRunner.run(new Configuration(), new EthanGrep(), args);
   }

  }

}

No compilation error. But when I try later:
bin/hadoop jar hadoop-examples-*.jar EthanTest

it complains no such "EthanTest" found, valid names are grep etc. it seems
I miss something here?

Thanks again,
Ethan

On Sun, Apr 22, 2012 at 4:59 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Ethan,
>
> First question: Yes, that is what I meant.
>
> Second question: When you do a fork, the PID of the last command from
> the script is stored a "$!" variable. You can grab these each time you
> do a fork and then monitor them (at least PID-wise).
>
> I'm still not sure what you mean by "especially if I need to run
> different kinds of jobs and control the inter-arrival time?" actually
> but forking is the answer to your other need, if you can't change
> code.
>
> On Sun, Apr 22, 2012 at 11:42 PM, brisk <mylinq...@gmail.com> wrote:
> > Hi, Harsh,
> >
> > Thanks so much for your answer!
> >
> > By "run multiple command lines using a fork and managing them
> afterwards",
> > do you mean just put "&" at the end of each command and let each command
> > line run in the background? Then what do you mean by "managing them
> > afterwards"?
> >
> > Best,
> > Ethan
> >
> > On Sun, Apr 22, 2012 at 12:02 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Is your requirement to not have the job launcher program return until
> >> completion? For that you should either edit the java sources to not
> >> waitForCompletion(…) (and just submit()), or run multiple command
> >> lines using a fork and managing them afterwards.
> >>
> >> For example you can do:
> >> bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' &
> >>
> >> And the process should run in the background until termination,
> >> allowing you to run another without needing to open a new terminal.
> >>
> >> Is this what you're looking for?
> >>
> >> On Sun, Apr 22, 2012 at 9:51 PM, brisk <mylinq...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Does anybody know how to submit multiple hadoop jobs without opening
> >> > multiple terminals? I found one method is to use Job.Submit() in
> >> > ToolRunner.run(),
> >> > but can I use a shell script to submit jobs (with command like
> >> > "bin/hadoop
> >> > jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' ") instead of
> >> >  modifying java files/source code,
> >> > especially if I need to run different kinds of jobs and control the
> >> > inter-arrival time?
> >> >
> >> > Thanks,
> >> > Ethan
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Reply via email to