Libjars works if your MR job is initialized correctly. Here's a code snippet:
public static void main(String[] args) throws Exception { GenericOptionsParser optParser = new GenericOptionsParser(args); int exitCode = ToolRunner.run(optParser.getConfiguration(), new MyMRJob(), optParser.getRemainingArgs()); System.exit(exitCode); } Pig works by re-jarring your whole application, and there's an outstanding patch to make it run libjars -- which works, I've been running it in production at Twitter. -D On Wed, Nov 16, 2011 at 9:00 AM, Something Something <mailinglist...@gmail.com> wrote: > I agree. It will eventually get us in trouble. That's why we want to get > the -libjars option to work, but it's not working.. arrrghhh.. It's the > simplest things in engineering that take the longest time... -:) > > Can you see why this may not work? > > /Users/xyz/hadoop-0.20.2/bin/hadoop jar > /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce > -libjars /Users/xyz/modules/something/target/my.jar, > /Users/xyz/avro-tools-1.5.4.jar > > On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven > <fvanvollenho...@xebia.com> wrote: >> >> You use maven jar-with-deps default assembly? That layout works too, but >> it will give you problems eventually when you have different classes with >> the same package and name. >> Java jar files are regular ZIP files. They can contain duplicate entries. >> I don't know whether your packaging creates duplicates in them, but if it >> does, it could be the cause of your problem. >> Try checking your jar for a duplicate license dir in the META-INF >> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort | >> uniq -d) >> >> Friso >> >> On 16 nov. 2011, at 17:33, Something Something wrote: >> >> Thanks Bejoy & Friso. When I use the all-in-one jar file created by Maven >> I get this: >> >> Mkdirs failed to create >> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license >> >> >> Do you recall coming across this? Our 'all-in-one' jar is not exactly how >> you have described it. It doesn't contain any JARs, but it has all the >> classes from all the dependent JARs. >> >> >> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven >> <fvanvollenho...@xebia.com> wrote: >>> >>> We usually package my jobs as a single jar that contains a /lib directory >>> in the jar that contains all other jars that the job code depends on. Hadoop >>> understands this layout when run as 'hadoop jar'. So the jar layout would be >>> something like: >>> /META-INF/manifest.mf >>> /com/mypackage/MyMapperClass.class >>> /com/mypackage/MyReducerClass.class >>> /lib/dependency1.jar >>> /lib/dependency2.jar >>> etc. >>> If you use Maven or some other build tool with dependency management, you >>> can usually produce this jar as part of your build. We also have Maven write >>> the main class to the manifest, such that there is no need to type it. So >>> for us, submitting a job looks like: >>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN >>> Then Hadoop will take care of submitting and distributing, etc. Of course >>> you pay the penalty of always sending all of your dependencies over the wire >>> (the job jar gets replicated to 10 machines by default). Pre-distributing >>> sounds tedious and error prone to me. What if you have different jobs that >>> require different versions of the same dependency? >>> >>> HTH, >>> Friso >>> >>> >>> >>> >>> On 16 nov. 2011, at 15:42, Something Something wrote: >>> >>> Bejoy - Thanks for the reply. The '-libjars' is not working for me with >>> 'hadoop jar'. Also, as per the documentation >>> (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar): >>> >>> Generic Options >>> >>> The following options are supported >>> by dfsadmin, fs, fsck, job and fetchdt. >>> >>> >>> >>> Does it work for you? If it does, please let me know. >>> "Pre-distributing" definitely works, but is that the best way? If you have >>> a big cluster and Jars are changing often it will be time-consuming. >>> >>> Also, how does Pig do it? We update Pig UDFs often and put them only on >>> the 'client' machine (machine that starts the Pig job) and the UDF becomes >>> available to all machines in the cluster - automagically! Is Pig doing the >>> pre-distributing for us? >>> >>> Thanks for your patience & help with our questions. >>> >>> On Wed, Nov 16, 2011 at 6:29 AM, Something Something >>> <mailinglist...@gmail.com> wrote: >>>> >>>> Hmm... there must be a different way 'cause we don't need to do that to >>>> run Pig jobs. >>>> >>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <daan.ger...@gmail.com> >>>> wrote: >>>>> >>>>> There might be different ways but currently we are storing our jars >>>>> onto HDFS and register them from there. They will be copied to the machine >>>>> once the job starts. Is that an option? >>>>> >>>>> Daan. >>>>> >>>>> On 16 Nov 2011, at 07:24, Something Something wrote: >>>>> >>>>> > Until now we were manually copying our Jars to all machines in a >>>>> > Hadoop >>>>> > cluster. This used to work until our cluster size was small. Now >>>>> > our >>>>> > cluster is getting bigger. What's the best way to start a Hadoop Job >>>>> > that >>>>> > automatically distributes the Jar to all machines in a cluster? >>>>> > >>>>> > I read the doc at: >>>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar >>>>> > >>>>> > Would -libjars do the trick? But we need to use 'hadoop job' for >>>>> > that, >>>>> > right? Until now, we were using 'hadoop jar' to start all our jobs. >>>>> > >>>>> > Needless to say, we are getting our feet wet with Hadoop, so >>>>> > appreciate >>>>> > your help with our dumb questions. >>>>> > >>>>> > Thanks. >>>>> > >>>>> > PS: We use Pig a lot, which automatically does this, so there must >>>>> > be a >>>>> > clean way to do this. >>>>> >>>> >>> >>> >> >> > >