Re: hadoop jobs take long time to setup

Marcus Herou Sun, 28 Jun 2009 23:37:22 -0700

Of course... Thanks for the help!

Cheers


//Marcus

On Mon, Jun 29, 2009 at 12:32 AM, Mikhail Bautin <mbau...@gmail.com> wrote:

> Marcus,
>
> The code that needs to patched is in the tasktracker, because the
> tasktracker is what starts the child JVM that runs user code.
>
> Thanks,
> Mikhail
>
> On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou <marcus.he...@tailsweep.com
> >wrote:
>
> > Hi.
> >
> > Just to be clear. It is the jobtracker that needs the patched code right
> ?
> > Or is it the tasktrackers ?
> >
> > Kindly
> >
> > //Marcus
> >
> > On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin <mbau...@gmail.com>
> > wrote:
> >
> > > Marcus,
> > >
> > > We currently use 0.20.0 but this patch just inserts 8 lines of code
> into
> > > TaskRunner.java, which could certainly be done with 0.18.3.
> > >
> > > Yes, this patch just appends additional jars to the child JVM
> classpath.
> > >
> > > I've never really used tmpjars myself, but if it involves uploading
> > > multiple
> > > jar files into HDFS every time a job is started, I see how it can be
> > really
> > > slow. On our ~80-job workflow this would have really slowed things
> down.
> > >
> > > Thanks,
> > > Mikhail
> > >
> > > On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <
> > marcus.he...@tailsweep.com
> > > >wrote:
> > >
> > > > Makes sense... I will try both rsync and NFS but I think rsync will
> > beat
> > > > NFS
> > > > since NFS can be slow as hell sometimes but what the heck we already
> > have
> > > > our maven2 repo on NFS so why not :)
> > > >
> > > > Are you saying that this patch make the client able to configure
> which
> > > > "extra" local jar files to add as classpath when firing up the
> > > > TaskTrackerChild ?
> > > >
> > > > To be explicit: Do you confirm that using tmpjars like I do is a
> > costful
> > > > slow operation ?
> > > >
> > > > To what branch to you apply the patch (we use 0.18.3) ?
> > > >
> > > > Cheers
> > > >
> > > > //Marcus
> > > >
> > > >
> > > > On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <mbau...@gmail.com>
> > > > wrote:
> > > >
> > > > > This is the way we deal with this problem, too. We put our jar
> files
> > on
> > > > > NFS, and the attached patch makes possible to add those jar files
> to
> > > the
> > > > > tasktracker classpath through a configuration property.
> > > > >
> > > > > Thanks,
> > > > > Mikhail
> > > > >
> > > > > On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <
> > stuart.whi...@gmail.com
> > > > >wrote:
> > > > >
> > > > >> Although I've never done it, I believe you could manually copy
> your
> > > jar
> > > > >> files out to your cluster somewhere in hadoop's classpath, and
> that
> > > > would
> > > > >> remove the need for you to copy them to your cluster at the start
> of
> > > > each
> > > > >> job.
> > > > >>
> > > > >> On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
> > > > marcus.he...@tailsweep.com
> > > > >> >wrote:
> > > > >>
> > > > >> > Hi.
> > > > >> >
> > > > >> > Running without a jobtracker makes the job start almost
> instantly.
> > > > >> > I think it is due to something with the classloader. I use a
> huge
> > > > amount
> > > > >> of
> > > > >> > jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which
> need
> > > to
> > > > be
> > > > >> > loaded every time I guess.
> > > > >> >
> > > > >> > By issuing conf.setNumTasksToExecutePerJvm(-1); will the
> > TaskTracker
> > > > >> child
> > > > >> > live forever then ?
> > > > >> >
> > > > >> > Cheers
> > > > >> >
> > > > >> > //Marcus
> > > > >> >
> > > > >> > On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
> > > > >> timrobertson...@gmail.com
> > > > >> > >wrote:
> > > > >> >
> > > > >> > > How long does it take to start the code locally in a single
> > > thread?
> > > > >> > >
> > > > >> > > Can you reuse the JVM so it only starts once per node per job?
> > > > >> > > conf.setNumTasksToExecutePerJvm(-1)
> > > > >> > >
> > > > >> > > Cheers,
> > > > >> > > Tim
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
> > > > >> marcus.he...@tailsweep.com
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > > > Hi.
> > > > >> > > >
> > > > >> > > > Wonder how one should improve the startup times of a hadoop
> > job.
> > > > >> Some
> > > > >> > of
> > > > >> > > my
> > > > >> > > > jobs which have a lot of dependencies in terms of many jar
> > files
> > > > >> take a
> > > > >> > > long
> > > > >> > > > time to start in hadoop up to 2 minutes some times.
> > > > >> > > > The data input amounts in these cases are neglible so it
> seems
> > > > that
> > > > >> > > Hadoop
> > > > >> > > > have a really high setup cost, which I can live with but
> this
> > > > seems
> > > > >> to
> > > > >> > > much.
> > > > >> > > >
> > > > >> > > > Let's say a job takes 10 minutes to complete then it is bad
> if
> > > it
> > > > >> takes
> > > > >> > 2
> > > > >> > > > mins to set it up... 20-30 sec max would be a lot more
> > > reasonable.
> > > > >> > > >
> > > > >> > > > Hints ?
> > > > >> > > >
> > > > >> > > > //Marcus
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > > >> > > > +46702561312
> > > > >> > > > marcus.he...@tailsweep.com
> > > > >> > > > http://www.tailsweep.com/
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Marcus Herou CTO and co-founder Tailsweep AB
> > > > >> > +46702561312
> > > > >> > marcus.he...@tailsweep.com
> > > > >> > http://www.tailsweep.com/
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.he...@tailsweep.com
> > http://www.tailsweep.com/
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/

Re: hadoop jobs take long time to setup

Reply via email to