I don't see a clear solution from that mailing thread: simply keeping a TaskTrackerChild running longer won't solve the problem nicely because tasks from different jobs should have different classpaths, and I guess this is only supported in later versions of hadoop.
One simple way to go is to add the jars to hadoop-env.sh (which will add those jars to the classpath to TaskTracker). This is not a nice solution but it does give us all the performance gain no matter which hadoop version we are using. I think a better solution would be to add an option "mapred.local.classpath" to JobConf - which specifies the path of jars on the machines in the cluster. This should be done in the hadoop land - at the beginning of the main function in TaskTracker.Child (if TaskTracker.Child is reused, then we need to reset the classpath each time it is running a new task) What do you think? Zheng On Thu, Jul 30, 2009 at 11:54 AM, Edward Capriolo<[email protected]> wrote: > On Fri, Jul 24, 2009 at 1:45 PM, Edward Capriolo<[email protected]> wrote: >> On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shao<[email protected]> wrote: >>> Hive only needs to be installed at the node that runs the hive query. >>> All the jars will be sent to the hadoop JobClient via -libjars. The >>> code is in ExecDriver.java. >>> >>> In hadoop 0.17, I don't think there is a way to add a path to >>> classpath for a job (unless we put it in hadoop-env.sh and start >>> TaskTracker with that path). are there any changes in the latter >>> versions? >>> >>> >>> >>> Zheng >>> >>> >>> >>> On 7/24/09, Edward Capriolo <[email protected]> wrote: >>>> I have been following some threads on the hadoop mailing list about >>>> speeding up MR jobs. I have a few questions I am sure I can find the >>>> answer to if I dig into the source code but I thought I could get a >>>> quick answer. >>>> >>>> 1 ADD JAR 'myfile.jar' uses the distributed cache. Using the >>>> distributed cache has some overhead. I know if I create an auxlibs >>>> directory under hive root, they will be added to libjars on startup. >>>> If i add my jar to auxlibs on all my nodes will a UDF in the jar be >>>> available during subsequent jobs? Or is it only necessary to add those >>>> jars to the auxlib on the node I start the job from. >>>> >>>> 2 Dealing with the entire hive install. How much of the hive install >>>> really needs to be replication on each datanode? If we used >>>> distributed cache for everything the jobs would have unneeded >>>> overhead, but hive would be 'installed on demand' from the client. >>>> >>>> Thanks, >>>> Edward >>>> >>> >>> -- >>> Sent from Gmail for mobile | mobile.google.com >>> >>> Yours, >>> Zheng >>> >> >> Zheng, >> >> A thread from the hadoop list peaked my interest. search. >> "hadoop jobs take long time to setup" >> >> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%[email protected]%3e >> >> Can hive benefit? >> Edward >> > > Could we use something like this for a performance increase? With the > assumption that the jars are present on all task-trackers could we > have an alternate invocation script such as bin/hive-local ? > > Edward > -- Yours, Zheng
