Thanks David, I've been trying to use DistributedCache as I've had it suggested to me twice but I'm afraid I'm just not getting it.
It appears I need to associate my use of DistributedCache.addFileToClassPath() with a specific JobConf instance. If this is the case what does addFileToClassPath() give me that I don't already get with setJar()? The performance hit from using setJar() fir every Job is huge so I assume having to use addFileToClassPath() for every Job will also be huge. I'm looking to add a jar to my Hadoop classpath just once and then use it for many different map/reduce jobs. Effectively I am trying to dynamically have the same impact as hardcoding my jar file to HADOOP_CLASSPATH in hadoop-env.sh for every node in my system. I still can't see how to do this :( On Sep 15, 2010, at 11:46 AM, David Rosenstrauch <[email protected]> wrote: > On 09/14/2010 10:10 PM, Pete Tyler wrote: >> I'm trying to figure out how to achieve the following from a Java client, >> 1. My app (which is a web server) starts up >> 2. As part of startup my jar file, which includes my map reduce classes are >> distributed to hadoop nodes >> 3. My web app uses map reduce to extract data without the performance >> overhead of each job deploying a jar file, via setJar(), setJarByClass() >> >> It looks like DistributedCache() has potential but the need for commands >> like 'hadoop fs -copyFromLocal ...' and the API methods like >> '.getLocalCacheArchives()' look to be at odds with my scenario. Any thoughts? >> >> -Peter > > For step 2, you have 2 options on how to implement: > a) call DistributedCache.addFileToClassPath(jarFileURI, conf); > b) have your app implement Tool, use ToolRunner to launch it, and specify a > -libjars command line parm which will achieve the same effect as in (a). See > http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/util/Tool.html > and > http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/util/GenericOptionsParser.html#GenericOptions > for details. > > HTH, > > DR
