Re: hadoop jobs take long time to setup

2009-06-29 Thread Marcus Herou
Of course... Thanks for the help!

Cheers

//Marcus

On Mon, Jun 29, 2009 at 12:32 AM, Mikhail Bautin mbau...@gmail.com wrote:

 Marcus,

 The code that needs to patched is in the tasktracker, because the
 tasktracker is what starts the child JVM that runs user code.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Hi.
 
  Just to be clear. It is the jobtracker that needs the patched code right
 ?
  Or is it the tasktrackers ?
 
  Kindly
 
  //Marcus
 
  On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com
  wrote:
 
   Marcus,
  
   We currently use 0.20.0 but this patch just inserts 8 lines of code
 into
   TaskRunner.java, which could certainly be done with 0.18.3.
  
   Yes, this patch just appends additional jars to the child JVM
 classpath.
  
   I've never really used tmpjars myself, but if it involves uploading
   multiple
   jar files into HDFS every time a job is started, I see how it can be
  really
   slow. On our ~80-job workflow this would have really slowed things
 down.
  
   Thanks,
   Mikhail
  
   On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou 
  marcus.he...@tailsweep.com
   wrote:
  
Makes sense... I will try both rsync and NFS but I think rsync will
  beat
NFS
since NFS can be slow as hell sometimes but what the heck we already
  have
our maven2 repo on NFS so why not :)
   
Are you saying that this patch make the client able to configure
 which
extra local jar files to add as classpath when firing up the
TaskTrackerChild ?
   
To be explicit: Do you confirm that using tmpjars like I do is a
  costful
slow operation ?
   
To what branch to you apply the patch (we use 0.18.3) ?
   
Cheers
   
//Marcus
   
   
On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com
wrote:
   
 This is the way we deal with this problem, too. We put our jar
 files
  on
 NFS, and the attached patch makes possible to add those jar files
 to
   the
 tasktracker classpath through a configuration property.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 5:21 PM, Stuart White 
  stuart.whi...@gmail.com
wrote:

 Although I've never done it, I believe you could manually copy
 your
   jar
 files out to your cluster somewhere in hadoop's classpath, and
 that
would
 remove the need for you to copy them to your cluster at the start
 of
each
 job.

 On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou 
marcus.he...@tailsweep.com
 wrote:

  Hi.
 
  Running without a jobtracker makes the job start almost
 instantly.
  I think it is due to something with the classloader. I use a
 huge
amount
 of
  jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which
 need
   to
be
  loaded every time I guess.
 
  By issuing conf.setNumTasksToExecutePerJvm(-1); will the
  TaskTracker
 child
  live forever then ?
 
  Cheers
 
  //Marcus
 
  On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
 timrobertson...@gmail.com
  wrote:
 
   How long does it take to start the code locally in a single
   thread?
  
   Can you reuse the JVM so it only starts once per node per job?
   conf.setNumTasksToExecutePerJvm(-1)
  
   Cheers,
   Tim
  
  
  
   On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
 marcus.he...@tailsweep.com
  
   wrote:
Hi.
   
Wonder how one should improve the startup times of a hadoop
  job.
 Some
  of
   my
jobs which have a lot of dependencies in terms of many jar
  files
 take a
   long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it
 seems
that
   Hadoop
have a really high setup cost, which I can live with but
 this
seems
 to
   much.
   
Let's say a job takes 10 minutes to complete then it is bad
 if
   it
 takes
  2
mins to set it up... 20-30 sec max would be a lot more
   reasonable.
   
Hints ?
   
//Marcus
   
   
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
   
  
 
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 



   
   
  
 
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: hadoop jobs take long time to setup

2009-06-28 Thread tim robertson
How long does it take to start the code locally in a single thread?

Can you reuse the JVM so it only starts once per node per job?
conf.setNumTasksToExecutePerJvm(-1)

Cheers,
Tim



On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote:
 Hi.

 Wonder how one should improve the startup times of a hadoop job. Some of my
 jobs which have a lot of dependencies in terms of many jar files take a long
 time to start in hadoop up to 2 minutes some times.
 The data input amounts in these cases are neglible so it seems that Hadoop
 have a really high setup cost, which I can live with but this seems to much.

 Let's say a job takes 10 minutes to complete then it is bad if it takes 2
 mins to set it up... 20-30 sec max would be a lot more reasonable.

 Hints ?

 //Marcus


 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/



Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi.

Running without a jobtracker makes the job start almost instantly.
I think it is due to something with the classloader. I use a huge amount of
jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be
loaded every time I guess.

By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
live forever then ?

Cheers

//Marcus

On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.comwrote:

 How long does it take to start the code locally in a single thread?

 Can you reuse the JVM so it only starts once per node per job?
 conf.setNumTasksToExecutePerJvm(-1)

 Cheers,
 Tim



 On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com
 wrote:
  Hi.
 
  Wonder how one should improve the startup times of a hadoop job. Some of
 my
  jobs which have a lot of dependencies in terms of many jar files take a
 long
  time to start in hadoop up to 2 minutes some times.
  The data input amounts in these cases are neglible so it seems that
 Hadoop
  have a really high setup cost, which I can live with but this seems to
 much.
 
  Let's say a job takes 10 minutes to complete then it is bad if it takes 2
  mins to set it up... 20-30 sec max would be a lot more reasonable.
 
  Hints ?
 
  //Marcus
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: hadoop jobs take long time to setup

2009-06-28 Thread Stuart White
Although I've never done it, I believe you could manually copy your jar
files out to your cluster somewhere in hadoop's classpath, and that would
remove the need for you to copy them to your cluster at the start of each
job.

On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.comwrote:

 Hi.

 Running without a jobtracker makes the job start almost instantly.
 I think it is due to something with the classloader. I use a huge amount of
 jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be
 loaded every time I guess.

 By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
 live forever then ?

 Cheers

 //Marcus

 On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com
 wrote:

  How long does it take to start the code locally in a single thread?
 
  Can you reuse the JVM so it only starts once per node per job?
  conf.setNumTasksToExecutePerJvm(-1)
 
  Cheers,
  Tim
 
 
 
  On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com
 
  wrote:
   Hi.
  
   Wonder how one should improve the startup times of a hadoop job. Some
 of
  my
   jobs which have a lot of dependencies in terms of many jar files take a
  long
   time to start in hadoop up to 2 minutes some times.
   The data input amounts in these cases are neglible so it seems that
  Hadoop
   have a really high setup cost, which I can live with but this seems to
  much.
  
   Let's say a job takes 10 minutes to complete then it is bad if it takes
 2
   mins to set it up... 20-30 sec max would be a lot more reasonable.
  
   Hints ?
  
   //Marcus
  
  
   --
   Marcus Herou CTO and co-founder Tailsweep AB
   +46702561312
   marcus.he...@tailsweep.com
   http://www.tailsweep.com/
  
 



 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/



Re: hadoop jobs take long time to setup

2009-06-28 Thread Mikhail Bautin
This is the way we deal with this problem, too. We put our jar files on NFS,
and the attached patch makes possible to add those jar files to the
tasktracker classpath through a configuration property.

Thanks,
Mikhail

On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.comwrote:

 Although I've never done it, I believe you could manually copy your jar
 files out to your cluster somewhere in hadoop's classpath, and that would
 remove the need for you to copy them to your cluster at the start of each
 job.

 On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Hi.
 
  Running without a jobtracker makes the job start almost instantly.
  I think it is due to something with the classloader. I use a huge amount
 of
  jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be
  loaded every time I guess.
 
  By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
 child
  live forever then ?
 
  Cheers
 
  //Marcus
 
  On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
 timrobertson...@gmail.com
  wrote:
 
   How long does it take to start the code locally in a single thread?
  
   Can you reuse the JVM so it only starts once per node per job?
   conf.setNumTasksToExecutePerJvm(-1)
  
   Cheers,
   Tim
  
  
  
   On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
 marcus.he...@tailsweep.com
  
   wrote:
Hi.
   
Wonder how one should improve the startup times of a hadoop job. Some
  of
   my
jobs which have a lot of dependencies in terms of many jar files take
 a
   long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it seems that
   Hadoop
have a really high setup cost, which I can live with but this seems
 to
   much.
   
Let's say a job takes 10 minutes to complete then it is bad if it
 takes
  2
mins to set it up... 20-30 sec max would be a lot more reasonable.
   
Hints ?
   
//Marcus
   
   
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
   
  
 
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 

diff -rc java_original/org/apache/hadoop/mapred/TaskRunner.java java/org/apache/hadoop/mapred/TaskRunner.java
*** mapred_original/org/apache/hadoop/mapred/TaskRunner.java	2008-04-19 17:45:50.730243865 -0400
--- mapred/org/apache/hadoop/mapred/TaskRunner.java	2008-04-19 17:48:47.240302624 -0400
***
*** 262,267 
--- 262,279 
  
classPath.append(sep);
classPath.append(workDir);
+   
+   // Additional classpath specified by client (e.g. Jar libraries
+   // stored in NFS).
+   {
+ String additionalClassPath = 
+ conf.get(mapred.additional.class.path);
+ if (additionalClassPath != null) {
+   classPath.append(sep);
+   classPath.append(additionalClassPath);
+ }
+   }
+   
//  Build exec child jmv args.
VectorString vargs = new VectorString(8);
File jvm =  // use same jvm as parent


Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Makes sense... I will try both rsync and NFS but I think rsync will beat NFS
since NFS can be slow as hell sometimes but what the heck we already have
our maven2 repo on NFS so why not :)

Are you saying that this patch make the client able to configure which
extra local jar files to add as classpath when firing up the
TaskTrackerChild ?

To be explicit: Do you confirm that using tmpjars like I do is a costful
slow operation ?

To what branch to you apply the patch (we use 0.18.3) ?

Cheers

//Marcus


On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com wrote:

 This is the way we deal with this problem, too. We put our jar files on
 NFS, and the attached patch makes possible to add those jar files to the
 tasktracker classpath through a configuration property.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.comwrote:

 Although I've never done it, I believe you could manually copy your jar
 files out to your cluster somewhere in hadoop's classpath, and that would
 remove the need for you to copy them to your cluster at the start of each
 job.

 On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Hi.
 
  Running without a jobtracker makes the job start almost instantly.
  I think it is due to something with the classloader. I use a huge amount
 of
  jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be
  loaded every time I guess.
 
  By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
 child
  live forever then ?
 
  Cheers
 
  //Marcus
 
  On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
 timrobertson...@gmail.com
  wrote:
 
   How long does it take to start the code locally in a single thread?
  
   Can you reuse the JVM so it only starts once per node per job?
   conf.setNumTasksToExecutePerJvm(-1)
  
   Cheers,
   Tim
  
  
  
   On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
 marcus.he...@tailsweep.com
  
   wrote:
Hi.
   
Wonder how one should improve the startup times of a hadoop job.
 Some
  of
   my
jobs which have a lot of dependencies in terms of many jar files
 take a
   long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it seems that
   Hadoop
have a really high setup cost, which I can live with but this seems
 to
   much.
   
Let's say a job takes 10 minutes to complete then it is bad if it
 takes
  2
mins to set it up... 20-30 sec max would be a lot more reasonable.
   
Hints ?
   
//Marcus
   
   
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
   
  
 
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: hadoop jobs take long time to setup

2009-06-28 Thread Mikhail Bautin
Marcus,

We currently use 0.20.0 but this patch just inserts 8 lines of code into
TaskRunner.java, which could certainly be done with 0.18.3.

Yes, this patch just appends additional jars to the child JVM classpath.

I've never really used tmpjars myself, but if it involves uploading multiple
jar files into HDFS every time a job is started, I see how it can be really
slow. On our ~80-job workflow this would have really slowed things down.

Thanks,
Mikhail

On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou marcus.he...@tailsweep.comwrote:

 Makes sense... I will try both rsync and NFS but I think rsync will beat
 NFS
 since NFS can be slow as hell sometimes but what the heck we already have
 our maven2 repo on NFS so why not :)

 Are you saying that this patch make the client able to configure which
 extra local jar files to add as classpath when firing up the
 TaskTrackerChild ?

 To be explicit: Do you confirm that using tmpjars like I do is a costful
 slow operation ?

 To what branch to you apply the patch (we use 0.18.3) ?

 Cheers

 //Marcus


 On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com
 wrote:

  This is the way we deal with this problem, too. We put our jar files on
  NFS, and the attached patch makes possible to add those jar files to the
  tasktracker classpath through a configuration property.
 
  Thanks,
  Mikhail
 
  On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.com
 wrote:
 
  Although I've never done it, I believe you could manually copy your jar
  files out to your cluster somewhere in hadoop's classpath, and that
 would
  remove the need for you to copy them to your cluster at the start of
 each
  job.
 
  On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou 
 marcus.he...@tailsweep.com
  wrote:
 
   Hi.
  
   Running without a jobtracker makes the job start almost instantly.
   I think it is due to something with the classloader. I use a huge
 amount
  of
   jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to
 be
   loaded every time I guess.
  
   By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
  child
   live forever then ?
  
   Cheers
  
   //Marcus
  
   On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
  timrobertson...@gmail.com
   wrote:
  
How long does it take to start the code locally in a single thread?
   
Can you reuse the JVM so it only starts once per node per job?
conf.setNumTasksToExecutePerJvm(-1)
   
Cheers,
Tim
   
   
   
On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
  marcus.he...@tailsweep.com
   
wrote:
 Hi.

 Wonder how one should improve the startup times of a hadoop job.
  Some
   of
my
 jobs which have a lot of dependencies in terms of many jar files
  take a
long
 time to start in hadoop up to 2 minutes some times.
 The data input amounts in these cases are neglible so it seems
 that
Hadoop
 have a really high setup cost, which I can live with but this
 seems
  to
much.

 Let's say a job takes 10 minutes to complete then it is bad if it
  takes
   2
 mins to set it up... 20-30 sec max would be a lot more reasonable.

 Hints ?

 //Marcus


 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/

   
  
  
  
   --
   Marcus Herou CTO and co-founder Tailsweep AB
   +46702561312
   marcus.he...@tailsweep.com
   http://www.tailsweep.com/
  
 
 
 




Re: hadoop jobs take long time to setup

2009-06-28 Thread Mikhail Bautin
Marcus,

The code that needs to patched is in the tasktracker, because the
tasktracker is what starts the child JVM that runs user code.

Thanks,
Mikhail

On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou marcus.he...@tailsweep.comwrote:

 Hi.

 Just to be clear. It is the jobtracker that needs the patched code right ?
 Or is it the tasktrackers ?

 Kindly

 //Marcus

 On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com
 wrote:

  Marcus,
 
  We currently use 0.20.0 but this patch just inserts 8 lines of code into
  TaskRunner.java, which could certainly be done with 0.18.3.
 
  Yes, this patch just appends additional jars to the child JVM classpath.
 
  I've never really used tmpjars myself, but if it involves uploading
  multiple
  jar files into HDFS every time a job is started, I see how it can be
 really
  slow. On our ~80-job workflow this would have really slowed things down.
 
  Thanks,
  Mikhail
 
  On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou 
 marcus.he...@tailsweep.com
  wrote:
 
   Makes sense... I will try both rsync and NFS but I think rsync will
 beat
   NFS
   since NFS can be slow as hell sometimes but what the heck we already
 have
   our maven2 repo on NFS so why not :)
  
   Are you saying that this patch make the client able to configure which
   extra local jar files to add as classpath when firing up the
   TaskTrackerChild ?
  
   To be explicit: Do you confirm that using tmpjars like I do is a
 costful
   slow operation ?
  
   To what branch to you apply the patch (we use 0.18.3) ?
  
   Cheers
  
   //Marcus
  
  
   On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com
   wrote:
  
This is the way we deal with this problem, too. We put our jar files
 on
NFS, and the attached patch makes possible to add those jar files to
  the
tasktracker classpath through a configuration property.
   
Thanks,
Mikhail
   
On Sun, Jun 28, 2009 at 5:21 PM, Stuart White 
 stuart.whi...@gmail.com
   wrote:
   
Although I've never done it, I believe you could manually copy your
  jar
files out to your cluster somewhere in hadoop's classpath, and that
   would
remove the need for you to copy them to your cluster at the start of
   each
job.
   
On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou 
   marcus.he...@tailsweep.com
wrote:
   
 Hi.

 Running without a jobtracker makes the job start almost instantly.
 I think it is due to something with the classloader. I use a huge
   amount
of
 jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need
  to
   be
 loaded every time I guess.

 By issuing conf.setNumTasksToExecutePerJvm(-1); will the
 TaskTracker
child
 live forever then ?

 Cheers

 //Marcus

 On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
timrobertson...@gmail.com
 wrote:

  How long does it take to start the code locally in a single
  thread?
 
  Can you reuse the JVM so it only starts once per node per job?
  conf.setNumTasksToExecutePerJvm(-1)
 
  Cheers,
  Tim
 
 
 
  On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
marcus.he...@tailsweep.com
 
  wrote:
   Hi.
  
   Wonder how one should improve the startup times of a hadoop
 job.
Some
 of
  my
   jobs which have a lot of dependencies in terms of many jar
 files
take a
  long
   time to start in hadoop up to 2 minutes some times.
   The data input amounts in these cases are neglible so it seems
   that
  Hadoop
   have a really high setup cost, which I can live with but this
   seems
to
  much.
  
   Let's say a job takes 10 minutes to complete then it is bad if
  it
takes
 2
   mins to set it up... 20-30 sec max would be a lot more
  reasonable.
  
   Hints ?
  
   //Marcus
  
  
   --
   Marcus Herou CTO and co-founder Tailsweep AB
   +46702561312
   marcus.he...@tailsweep.com
   http://www.tailsweep.com/
  
 



 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/

   
   
   
  
  
 



 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/



Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi.

Just to be clear. It is the jobtracker that needs the patched code right ?
Or is it the tasktrackers ?

Kindly

//Marcus

On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com wrote:

 Marcus,

 We currently use 0.20.0 but this patch just inserts 8 lines of code into
 TaskRunner.java, which could certainly be done with 0.18.3.

 Yes, this patch just appends additional jars to the child JVM classpath.

 I've never really used tmpjars myself, but if it involves uploading
 multiple
 jar files into HDFS every time a job is started, I see how it can be really
 slow. On our ~80-job workflow this would have really slowed things down.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Makes sense... I will try both rsync and NFS but I think rsync will beat
  NFS
  since NFS can be slow as hell sometimes but what the heck we already have
  our maven2 repo on NFS so why not :)
 
  Are you saying that this patch make the client able to configure which
  extra local jar files to add as classpath when firing up the
  TaskTrackerChild ?
 
  To be explicit: Do you confirm that using tmpjars like I do is a costful
  slow operation ?
 
  To what branch to you apply the patch (we use 0.18.3) ?
 
  Cheers
 
  //Marcus
 
 
  On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com
  wrote:
 
   This is the way we deal with this problem, too. We put our jar files on
   NFS, and the attached patch makes possible to add those jar files to
 the
   tasktracker classpath through a configuration property.
  
   Thanks,
   Mikhail
  
   On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.com
  wrote:
  
   Although I've never done it, I believe you could manually copy your
 jar
   files out to your cluster somewhere in hadoop's classpath, and that
  would
   remove the need for you to copy them to your cluster at the start of
  each
   job.
  
   On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou 
  marcus.he...@tailsweep.com
   wrote:
  
Hi.
   
Running without a jobtracker makes the job start almost instantly.
I think it is due to something with the classloader. I use a huge
  amount
   of
jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need
 to
  be
loaded every time I guess.
   
By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
   child
live forever then ?
   
Cheers
   
//Marcus
   
On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
   timrobertson...@gmail.com
wrote:
   
 How long does it take to start the code locally in a single
 thread?

 Can you reuse the JVM so it only starts once per node per job?
 conf.setNumTasksToExecutePerJvm(-1)

 Cheers,
 Tim



 On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
   marcus.he...@tailsweep.com

 wrote:
  Hi.
 
  Wonder how one should improve the startup times of a hadoop job.
   Some
of
 my
  jobs which have a lot of dependencies in terms of many jar files
   take a
 long
  time to start in hadoop up to 2 minutes some times.
  The data input amounts in these cases are neglible so it seems
  that
 Hadoop
  have a really high setup cost, which I can live with but this
  seems
   to
 much.
 
  Let's say a job takes 10 minutes to complete then it is bad if
 it
   takes
2
  mins to set it up... 20-30 sec max would be a lot more
 reasonable.
 
  Hints ?
 
  //Marcus
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 

   
   
   
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
   
  
  
  
 
 




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/