John, If you are using Oozie, dropping all the JARs your MR jobs needs in the Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs are in the distributed cache.
Alejandro On Thu, May 26, 2011 at 7:45 AM, John Armstrong <john.armstr...@ccri.com>wrote: > Hi, everybody. > > I'm running into some difficulties getting needed libraries to map/reduce > tasks using the distributed cache. > > I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement > by the client, so more current versions are not really viable options. > > The code I've inherited is Java, which sets up and runs the MR job. > There's currently some nontrivial pre- and post-processing, so it will be a > large refactoring before I can just run bare MR jobs rather than starting > them through Java. > > Further complicating matters: in practice the Java jobs are launched by > Oozie, which of course does so by wrapping each one in a MR shell. The > upshot is that I don't have any control over which "local" filesystem the > Java job is run from, though if local files are absolutely needed I can > make my Java wrappers copy stuff back from HDFS to the Java job's local > filesystem. > > So here's the problem > > mappers and/or reducers need class Needed, which is contained in > needed-1.0.jar, which is in HDFS: > hdfs://.../libdir/distributed/needed-1.0.jar > > Java program executes: > DistributedCache.addFiletoClassPath(new > > Path("hdfs://.../libdir/distributed/needed-1.0.jar"),job.getConfiguration()); > > Inspecting the Job object I find the file has been added to the cache > files as expected: > job.conf.overlay[...] = mapred.cache.files -> > hdfs://.../libdir/distributed/needed-1.0.jar > job.conf.properties[...] = mapred.cache.files -> > hdfs://.../libdir/distributed/needed-1.0.jar > > And the class seems to show up in the internal ClassLoader: > job.conf.classLoader.classes[...] = "class my.class.package.Needed" > > though this may just be inherited from the ClassLoader of the Java process > itself (which also uses Needed). > > And yet as soon as I get into the mapreduce job itself I start getting: > > 2011-05-25 17:22:56,080 INFO JobClient - Task Id : > attempt_201105251330_0037_r_000043_0, Status : FAILED > java.lang.RuntimeException: java.lang.ClassNotFoundException: > my.class.package.Needed > > Up until this point we've run things by having a directory on each node > containing all the libraries we'd need, and including that in the Hadoop > classpath, but we have no such control in the deployment scenario, so we > have to make our program hand the needed libraries to the map and reduce > nodes via the distributed cache classpath. > > Thanks in advance for any insight or assistance you can offer. >