Re: Distributing our jars to all machines in a cluster

Praveen Sripati Sat, 19 Nov 2011 18:26:11 -0800

Hi,

Here are the different ways of distributing 3rd party jars with the
application.


http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Thanks,
Praveen

On Wed, Nov 16, 2011 at 11:30 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:

> Libjars works if your MR job is initialized correctly. Here's a code
> snippet:
>
>  public static void main(String[] args) throws Exception {
>    GenericOptionsParser optParser = new GenericOptionsParser(args);
>    int exitCode = ToolRunner.run(optParser.getConfiguration(),
>        new MyMRJob(),
>        optParser.getRemainingArgs());
>    System.exit(exitCode);
>  }
>
> Pig works by re-jarring your whole application, and there's an
> outstanding patch to make it run libjars -- which works, I've been
> running it in production at Twitter.
>
> -D
>
> On Wed, Nov 16, 2011 at 9:00 AM, Something Something
> <mailinglist...@gmail.com> wrote:
> > I agree.  It will eventually get us in trouble.  That's why we want to
> get
> > the -libjars option to work, but it's not working.. arrrghhh..  It's the
> > simplest things in engineering that take the longest time... -:)
> >
> > Can you see why this may not work?
> >
> > /Users/xyz/hadoop-0.20.2/bin/hadoop jar
> > /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
> > -libjars /Users/xyz/modules/something/target/my.jar,
> > /Users/xyz/avro-tools-1.5.4.jar
> >
> > On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven
> > <fvanvollenho...@xebia.com> wrote:
> >>
> >> You use maven jar-with-deps default assembly? That layout works too, but
> >> it will give you problems eventually when you have different classes
> with
> >> the same package and name.
> >> Java jar files are regular ZIP files. They can contain duplicate
> entries.
> >> I don't know whether your packaging creates duplicates in them, but if
> it
> >> does, it could be the cause of your problem.
> >> Try checking your jar for a duplicate license dir in the META-INF
> >> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort
> |
> >> uniq -d)
> >>
> >> Friso
> >>
> >> On 16 nov. 2011, at 17:33, Something Something wrote:
> >>
> >> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by
> Maven
> >> I get this:
> >>
> >> Mkdirs failed to create
> >> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
> >>
> >>
> >> Do you recall coming across this?  Our 'all-in-one' jar is not exactly
> how
> >> you have described it.  It doesn't contain any JARs, but it has all the
> >> classes from all the dependent JARs.
> >>
> >>
> >> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven
> >> <fvanvollenho...@xebia.com> wrote:
> >>>
> >>> We usually package my jobs as a single jar that contains a /lib
> directory
> >>> in the jar that contains all other jars that the job code depends on.
> Hadoop
> >>> understands this layout when run as 'hadoop jar'. So the jar layout
> would be
> >>> something like:
> >>> /META-INF/manifest.mf
> >>> /com/mypackage/MyMapperClass.class
> >>> /com/mypackage/MyReducerClass.class
> >>> /lib/dependency1.jar
> >>> /lib/dependency2.jar
> >>> etc.
> >>> If you use Maven or some other build tool with dependency management,
> you
> >>> can usually produce this jar as part of your build. We also have Maven
> write
> >>> the main class to the manifest, such that there is no need to type it.
> So
> >>> for us, submitting a job looks like:
> >>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
> >>> Then Hadoop will take care of submitting and distributing, etc. Of
> course
> >>> you pay the penalty of always sending all of your dependencies over
> the wire
> >>> (the job jar gets replicated to 10 machines by
> default). Pre-distributing
> >>> sounds tedious and error prone to me. What if you have different jobs
> that
> >>> require different versions of the same dependency?
> >>>
> >>> HTH,
> >>> Friso
> >>>
> >>>
> >>>
> >>>
> >>> On 16 nov. 2011, at 15:42, Something Something wrote:
> >>>
> >>> Bejoy - Thanks for the reply.  The '-libjars' is not working for me
> with
> >>> 'hadoop jar'.  Also, as per the documentation
> >>> (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> ):
> >>>
> >>> Generic Options
> >>>
> >>> The following options are supported
> >>> by dfsadmin, fs, fsck, job and fetchdt.
> >>>
> >>>
> >>>
> >>> Does it work for you?  If it does, please let me know.
> >>>  "Pre-distributing" definitely works, but is that the best way?  If
> you have
> >>> a big cluster and Jars are changing often it will be time-consuming.
> >>>
> >>> Also, how does Pig do it?  We update Pig UDFs often and put them only
> on
> >>> the 'client' machine (machine that starts the Pig job) and the UDF
> becomes
> >>> available to all machines in the cluster - automagically!  Is Pig
> doing the
> >>> pre-distributing for us?
> >>>
> >>> Thanks for your patience & help with our questions.
> >>>
> >>> On Wed, Nov 16, 2011 at 6:29 AM, Something Something
> >>> <mailinglist...@gmail.com> wrote:
> >>>>
> >>>> Hmm... there must be a different way 'cause we don't need to do that
> to
> >>>> run Pig jobs.
> >>>>
> >>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <daan.ger...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> There might be different ways but currently we are storing our jars
> >>>>> onto HDFS and register them from there. They will be copied to the
> machine
> >>>>> once the job starts. Is that an option?
> >>>>>
> >>>>> Daan.
> >>>>>
> >>>>> On 16 Nov 2011, at 07:24, Something Something wrote:
> >>>>>
> >>>>> > Until now we were manually copying our Jars to all machines in a
> >>>>> > Hadoop
> >>>>> > cluster.  This used to work until our cluster size was small.  Now
> >>>>> > our
> >>>>> > cluster is getting bigger.  What's the best way to start a Hadoop
> Job
> >>>>> > that
> >>>>> > automatically distributes the Jar to all machines in a cluster?
> >>>>> >
> >>>>> > I read the doc at:
> >>>>> >
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> >>>>> >
> >>>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
> >>>>> > that,
> >>>>> > right?  Until now, we were using 'hadoop jar' to start all our
> jobs.
> >>>>> >
> >>>>> > Needless to say, we are getting our feet wet with Hadoop, so
> >>>>> > appreciate
> >>>>> > your help with our dumb questions.
> >>>>> >
> >>>>> > Thanks.
> >>>>> >
> >>>>> > PS:  We use Pig a lot, which automatically does this, so there must
> >>>>> > be a
> >>>>> > clean way to do this.
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>

Re: Distributing our jars to all machines in a cluster

Reply via email to