Re: hadoop jobs take long time to setup
Of course... Thanks for the help! Cheers //Marcus On Mon, Jun 29, 2009 at 12:32 AM, Mikhail Bautin mbau...@gmail.com wrote: Marcus, The code that needs to patched is in the tasktracker, because the tasktracker is what starts the child JVM that runs user code. Thanks, Mikhail On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Just to be clear. It is the jobtracker that needs the patched code right ? Or is it the tasktrackers ? Kindly //Marcus On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com wrote: Marcus, We currently use 0.20.0 but this patch just inserts 8 lines of code into TaskRunner.java, which could certainly be done with 0.18.3. Yes, this patch just appends additional jars to the child JVM classpath. I've never really used tmpjars myself, but if it involves uploading multiple jar files into HDFS every time a job is started, I see how it can be really slow. On our ~80-job workflow this would have really slowed things down. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Makes sense... I will try both rsync and NFS but I think rsync will beat NFS since NFS can be slow as hell sometimes but what the heck we already have our maven2 repo on NFS so why not :) Are you saying that this patch make the client able to configure which extra local jar files to add as classpath when firing up the TaskTrackerChild ? To be explicit: Do you confirm that using tmpjars like I do is a costful slow operation ? To what branch to you apply the patch (we use 0.18.3) ? Cheers //Marcus On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com wrote: This is the way we deal with this problem, too. We put our jar files on NFS, and the attached patch makes possible to add those jar files to the tasktracker classpath through a configuration property. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.com wrote: Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com wrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: hadoop jobs take long time to setup
How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: hadoop jobs take long time to setup
Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.comwrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: hadoop jobs take long time to setup
Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.comwrote: Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com wrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: hadoop jobs take long time to setup
This is the way we deal with this problem, too. We put our jar files on NFS, and the attached patch makes possible to add those jar files to the tasktracker classpath through a configuration property. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.comwrote: Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com wrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ diff -rc java_original/org/apache/hadoop/mapred/TaskRunner.java java/org/apache/hadoop/mapred/TaskRunner.java *** mapred_original/org/apache/hadoop/mapred/TaskRunner.java 2008-04-19 17:45:50.730243865 -0400 --- mapred/org/apache/hadoop/mapred/TaskRunner.java 2008-04-19 17:48:47.240302624 -0400 *** *** 262,267 --- 262,279 classPath.append(sep); classPath.append(workDir); + + // Additional classpath specified by client (e.g. Jar libraries + // stored in NFS). + { + String additionalClassPath = + conf.get(mapred.additional.class.path); + if (additionalClassPath != null) { + classPath.append(sep); + classPath.append(additionalClassPath); + } + } + // Build exec child jmv args. VectorString vargs = new VectorString(8); File jvm = // use same jvm as parent
Re: hadoop jobs take long time to setup
Makes sense... I will try both rsync and NFS but I think rsync will beat NFS since NFS can be slow as hell sometimes but what the heck we already have our maven2 repo on NFS so why not :) Are you saying that this patch make the client able to configure which extra local jar files to add as classpath when firing up the TaskTrackerChild ? To be explicit: Do you confirm that using tmpjars like I do is a costful slow operation ? To what branch to you apply the patch (we use 0.18.3) ? Cheers //Marcus On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com wrote: This is the way we deal with this problem, too. We put our jar files on NFS, and the attached patch makes possible to add those jar files to the tasktracker classpath through a configuration property. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.comwrote: Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com wrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: hadoop jobs take long time to setup
Marcus, We currently use 0.20.0 but this patch just inserts 8 lines of code into TaskRunner.java, which could certainly be done with 0.18.3. Yes, this patch just appends additional jars to the child JVM classpath. I've never really used tmpjars myself, but if it involves uploading multiple jar files into HDFS every time a job is started, I see how it can be really slow. On our ~80-job workflow this would have really slowed things down. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou marcus.he...@tailsweep.comwrote: Makes sense... I will try both rsync and NFS but I think rsync will beat NFS since NFS can be slow as hell sometimes but what the heck we already have our maven2 repo on NFS so why not :) Are you saying that this patch make the client able to configure which extra local jar files to add as classpath when firing up the TaskTrackerChild ? To be explicit: Do you confirm that using tmpjars like I do is a costful slow operation ? To what branch to you apply the patch (we use 0.18.3) ? Cheers //Marcus On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com wrote: This is the way we deal with this problem, too. We put our jar files on NFS, and the attached patch makes possible to add those jar files to the tasktracker classpath through a configuration property. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.com wrote: Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com wrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: hadoop jobs take long time to setup
Marcus, The code that needs to patched is in the tasktracker, because the tasktracker is what starts the child JVM that runs user code. Thanks, Mikhail On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou marcus.he...@tailsweep.comwrote: Hi. Just to be clear. It is the jobtracker that needs the patched code right ? Or is it the tasktrackers ? Kindly //Marcus On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com wrote: Marcus, We currently use 0.20.0 but this patch just inserts 8 lines of code into TaskRunner.java, which could certainly be done with 0.18.3. Yes, this patch just appends additional jars to the child JVM classpath. I've never really used tmpjars myself, but if it involves uploading multiple jar files into HDFS every time a job is started, I see how it can be really slow. On our ~80-job workflow this would have really slowed things down. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Makes sense... I will try both rsync and NFS but I think rsync will beat NFS since NFS can be slow as hell sometimes but what the heck we already have our maven2 repo on NFS so why not :) Are you saying that this patch make the client able to configure which extra local jar files to add as classpath when firing up the TaskTrackerChild ? To be explicit: Do you confirm that using tmpjars like I do is a costful slow operation ? To what branch to you apply the patch (we use 0.18.3) ? Cheers //Marcus On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com wrote: This is the way we deal with this problem, too. We put our jar files on NFS, and the attached patch makes possible to add those jar files to the tasktracker classpath through a configuration property. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.com wrote: Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com wrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: hadoop jobs take long time to setup
Hi. Just to be clear. It is the jobtracker that needs the patched code right ? Or is it the tasktrackers ? Kindly //Marcus On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com wrote: Marcus, We currently use 0.20.0 but this patch just inserts 8 lines of code into TaskRunner.java, which could certainly be done with 0.18.3. Yes, this patch just appends additional jars to the child JVM classpath. I've never really used tmpjars myself, but if it involves uploading multiple jar files into HDFS every time a job is started, I see how it can be really slow. On our ~80-job workflow this would have really slowed things down. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Makes sense... I will try both rsync and NFS but I think rsync will beat NFS since NFS can be slow as hell sometimes but what the heck we already have our maven2 repo on NFS so why not :) Are you saying that this patch make the client able to configure which extra local jar files to add as classpath when firing up the TaskTrackerChild ? To be explicit: Do you confirm that using tmpjars like I do is a costful slow operation ? To what branch to you apply the patch (we use 0.18.3) ? Cheers //Marcus On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com wrote: This is the way we deal with this problem, too. We put our jar files on NFS, and the attached patch makes possible to add those jar files to the tasktracker classpath through a configuration property. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.com wrote: Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.com wrote: How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2 mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/