Finally got to test this. It reduced the startup from 40+ secs to below 10
secs which is grrrrrrrrrrrrrrrrreat!
However we modified the patch slightly.
String additionalClassPath = conf.get("mapred.additional.class.path");
if (additionalClassPath != null)
{
String[] localfiles = additionalClassPath.split(",");
for(int i = 0; i < localfiles.length;i++)
{
String localfile = localfiles[i].trim();
LOG.info("Adding "+localfile);
classPath.append(sep);
classPath.append(localfile);
}
}
Cheers
//Marcus
On Mon, Jun 29, 2009 at 8:36 AM, Marcus Herou <[email protected]>wrote:
> Of course... Thanks for the help!
>
> Cheers
>
> //Marcus
>
>
> On Mon, Jun 29, 2009 at 12:32 AM, Mikhail Bautin <[email protected]>wrote:
>
>> Marcus,
>>
>> The code that needs to patched is in the tasktracker, because the
>> tasktracker is what starts the child JVM that runs user code.
>>
>> Thanks,
>> Mikhail
>>
>> On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou <[email protected]
>> >wrote:
>>
>> > Hi.
>> >
>> > Just to be clear. It is the jobtracker that needs the patched code right
>> ?
>> > Or is it the tasktrackers ?
>> >
>> > Kindly
>> >
>> > //Marcus
>> >
>> > On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin <[email protected]>
>> > wrote:
>> >
>> > > Marcus,
>> > >
>> > > We currently use 0.20.0 but this patch just inserts 8 lines of code
>> into
>> > > TaskRunner.java, which could certainly be done with 0.18.3.
>> > >
>> > > Yes, this patch just appends additional jars to the child JVM
>> classpath.
>> > >
>> > > I've never really used tmpjars myself, but if it involves uploading
>> > > multiple
>> > > jar files into HDFS every time a job is started, I see how it can be
>> > really
>> > > slow. On our ~80-job workflow this would have really slowed things
>> down.
>> > >
>> > > Thanks,
>> > > Mikhail
>> > >
>> > > On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <
>> > [email protected]
>> > > >wrote:
>> > >
>> > > > Makes sense... I will try both rsync and NFS but I think rsync will
>> > beat
>> > > > NFS
>> > > > since NFS can be slow as hell sometimes but what the heck we already
>> > have
>> > > > our maven2 repo on NFS so why not :)
>> > > >
>> > > > Are you saying that this patch make the client able to configure
>> which
>> > > > "extra" local jar files to add as classpath when firing up the
>> > > > TaskTrackerChild ?
>> > > >
>> > > > To be explicit: Do you confirm that using tmpjars like I do is a
>> > costful
>> > > > slow operation ?
>> > > >
>> > > > To what branch to you apply the patch (we use 0.18.3) ?
>> > > >
>> > > > Cheers
>> > > >
>> > > > //Marcus
>> > > >
>> > > >
>> > > > On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <[email protected]
>> >
>> > > > wrote:
>> > > >
>> > > > > This is the way we deal with this problem, too. We put our jar
>> files
>> > on
>> > > > > NFS, and the attached patch makes possible to add those jar files
>> to
>> > > the
>> > > > > tasktracker classpath through a configuration property.
>> > > > >
>> > > > > Thanks,
>> > > > > Mikhail
>> > > > >
>> > > > > On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <
>> > [email protected]
>> > > > >wrote:
>> > > > >
>> > > > >> Although I've never done it, I believe you could manually copy
>> your
>> > > jar
>> > > > >> files out to your cluster somewhere in hadoop's classpath, and
>> that
>> > > > would
>> > > > >> remove the need for you to copy them to your cluster at the start
>> of
>> > > > each
>> > > > >> job.
>> > > > >>
>> > > > >> On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
>> > > > [email protected]
>> > > > >> >wrote:
>> > > > >>
>> > > > >> > Hi.
>> > > > >> >
>> > > > >> > Running without a jobtracker makes the job start almost
>> instantly.
>> > > > >> > I think it is due to something with the classloader. I use a
>> huge
>> > > > amount
>> > > > >> of
>> > > > >> > jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which
>> need
>> > > to
>> > > > be
>> > > > >> > loaded every time I guess.
>> > > > >> >
>> > > > >> > By issuing conf.setNumTasksToExecutePerJvm(-1); will the
>> > TaskTracker
>> > > > >> child
>> > > > >> > live forever then ?
>> > > > >> >
>> > > > >> > Cheers
>> > > > >> >
>> > > > >> > //Marcus
>> > > > >> >
>> > > > >> > On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
>> > > > >> [email protected]
>> > > > >> > >wrote:
>> > > > >> >
>> > > > >> > > How long does it take to start the code locally in a single
>> > > thread?
>> > > > >> > >
>> > > > >> > > Can you reuse the JVM so it only starts once per node per
>> job?
>> > > > >> > > conf.setNumTasksToExecutePerJvm(-1)
>> > > > >> > >
>> > > > >> > > Cheers,
>> > > > >> > > Tim
>> > > > >> > >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
>> > > > >> [email protected]
>> > > > >> > >
>> > > > >> > > wrote:
>> > > > >> > > > Hi.
>> > > > >> > > >
>> > > > >> > > > Wonder how one should improve the startup times of a hadoop
>> > job.
>> > > > >> Some
>> > > > >> > of
>> > > > >> > > my
>> > > > >> > > > jobs which have a lot of dependencies in terms of many jar
>> > files
>> > > > >> take a
>> > > > >> > > long
>> > > > >> > > > time to start in hadoop up to 2 minutes some times.
>> > > > >> > > > The data input amounts in these cases are neglible so it
>> seems
>> > > > that
>> > > > >> > > Hadoop
>> > > > >> > > > have a really high setup cost, which I can live with but
>> this
>> > > > seems
>> > > > >> to
>> > > > >> > > much.
>> > > > >> > > >
>> > > > >> > > > Let's say a job takes 10 minutes to complete then it is bad
>> if
>> > > it
>> > > > >> takes
>> > > > >> > 2
>> > > > >> > > > mins to set it up... 20-30 sec max would be a lot more
>> > > reasonable.
>> > > > >> > > >
>> > > > >> > > > Hints ?
>> > > > >> > > >
>> > > > >> > > > //Marcus
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > --
>> > > > >> > > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > > >> > > > +46702561312
>> > > > >> > > > [email protected]
>> > > > >> > > > http://www.tailsweep.com/
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > --
>> > > > >> > Marcus Herou CTO and co-founder Tailsweep AB
>> > > > >> > +46702561312
>> > > > >> > [email protected]
>> > > > >> > http://www.tailsweep.com/
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Marcus Herou CTO and co-founder Tailsweep AB
>> > +46702561312
>> > [email protected]
>> > http://www.tailsweep.com/
>> >
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [email protected]
> http://www.tailsweep.com/
>
>
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/