Re: hadoop jobs take long time to setup

Mikhail Bautin Sun, 28 Jun 2009 15:09:04 -0700

Marcus,

We currently use 0.20.0 but this patch just inserts 8 lines of code into
TaskRunner.java, which could certainly be done with 0.18.3.


Yes, this patch just appends additional jars to the child JVM classpath.

I've never really used tmpjars myself, but if it involves uploading multiple
jar files into HDFS every time a job is started, I see how it can be really
slow. On our ~80-job workflow this would have really slowed things down.

Thanks,
Mikhail

On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <marcus.he...@tailsweep.com>wrote:

> Makes sense... I will try both rsync and NFS but I think rsync will beat
> NFS
> since NFS can be slow as hell sometimes but what the heck we already have
> our maven2 repo on NFS so why not :)
>
> Are you saying that this patch make the client able to configure which
> "extra" local jar files to add as classpath when firing up the
> TaskTrackerChild ?
>
> To be explicit: Do you confirm that using tmpjars like I do is a costful
> slow operation ?
>
> To what branch to you apply the patch (we use 0.18.3) ?
>
> Cheers
>
> //Marcus
>
>
> On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <mbau...@gmail.com>
> wrote:
>
> > This is the way we deal with this problem, too. We put our jar files on
> > NFS, and the attached patch makes possible to add those jar files to the
> > tasktracker classpath through a configuration property.
> >
> > Thanks,
> > Mikhail
> >
> > On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <stuart.whi...@gmail.com
> >wrote:
> >
> >> Although I've never done it, I believe you could manually copy your jar
> >> files out to your cluster somewhere in hadoop's classpath, and that
> would
> >> remove the need for you to copy them to your cluster at the start of
> each
> >> job.
> >>
> >> On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
> marcus.he...@tailsweep.com
> >> >wrote:
> >>
> >> > Hi.
> >> >
> >> > Running without a jobtracker makes the job start almost instantly.
> >> > I think it is due to something with the classloader. I use a huge
> amount
> >> of
> >> > jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to
> be
> >> > loaded every time I guess.
> >> >
> >> > By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
> >> child
> >> > live forever then ?
> >> >
> >> > Cheers
> >> >
> >> > //Marcus
> >> >
> >> > On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
> >> timrobertson...@gmail.com
> >> > >wrote:
> >> >
> >> > > How long does it take to start the code locally in a single thread?
> >> > >
> >> > > Can you reuse the JVM so it only starts once per node per job?
> >> > > conf.setNumTasksToExecutePerJvm(-1)
> >> > >
> >> > > Cheers,
> >> > > Tim
> >> > >
> >> > >
> >> > >
> >> > > On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
> >> marcus.he...@tailsweep.com
> >> > >
> >> > > wrote:
> >> > > > Hi.
> >> > > >
> >> > > > Wonder how one should improve the startup times of a hadoop job.
> >> Some
> >> > of
> >> > > my
> >> > > > jobs which have a lot of dependencies in terms of many jar files
> >> take a
> >> > > long
> >> > > > time to start in hadoop up to 2 minutes some times.
> >> > > > The data input amounts in these cases are neglible so it seems
> that
> >> > > Hadoop
> >> > > > have a really high setup cost, which I can live with but this
> seems
> >> to
> >> > > much.
> >> > > >
> >> > > > Let's say a job takes 10 minutes to complete then it is bad if it
> >> takes
> >> > 2
> >> > > > mins to set it up... 20-30 sec max would be a lot more reasonable.
> >> > > >
> >> > > > Hints ?
> >> > > >
> >> > > > //Marcus
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Marcus Herou CTO and co-founder Tailsweep AB
> >> > > > +46702561312
> >> > > > marcus.he...@tailsweep.com
> >> > > > http://www.tailsweep.com/
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Marcus Herou CTO and co-founder Tailsweep AB
> >> > +46702561312
> >> > marcus.he...@tailsweep.com
> >> > http://www.tailsweep.com/
> >> >
> >>
> >
> >
>
>

Re: hadoop jobs take long time to setup

Reply via email to