Re: Distributing our jars to all machines in a cluster

Something Something Wed, 16 Nov 2011 09:00:30 -0800

I agree.  It will eventually get us in trouble.  That's why we want to get
the -libjars option to work, but it's not working.. arrrghhh..  It's the
simplest things in engineering that take the longest time... -:)


Can you see why this may not work?

/Users/xyz/hadoop-0.20.2/bin/hadoop jar
/Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
-libjars /Users/xyz/modules/something/target/my.jar,
/Users/xyz/avro-tools-1.5.4.jar


On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven <
fvanvollenho...@xebia.com> wrote:

>  You use maven jar-with-deps default assembly? That layout works too, but
> it will give you problems eventually when you have different classes with
> the same package and name.
>
>  Java jar files are regular ZIP files. They can contain duplicate
> entries. I don't know whether your packaging creates duplicates in them,
> but if it does, it could be the cause of your problem.
>
>  Try checking your jar for a duplicate license dir in the META-INF
> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort |
> uniq -d)
>
>
>  Friso
>
>
>  On 16 nov. 2011, at 17:33, Something Something wrote:
>
> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
> I get this:
>
> Mkdirs failed to create
> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
>
>
> Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
> you have described it.  It doesn't contain any JARs, but it has all the
> classes from all the dependent JARs.
>
>
> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <
> fvanvollenho...@xebia.com> wrote:
>
>> We usually package my jobs as a single jar that contains a /lib directory
>> in the jar that contains all other jars that the job code depends on.
>> Hadoop understands this layout when run as 'hadoop jar'. So the jar layout
>> would be something like:
>>
>> /META-INF/manifest.mf
>>  /com/mypackage/MyMapperClass.class
>>  /com/mypackage/MyReducerClass.class
>>  /lib/dependency1.jar
>>  /lib/dependency2.jar
>>  etc.
>>
>>  If you use Maven or some other build tool with dependency management,
>> you can usually produce this jar as part of your build. We also have Maven
>> write the main class to the manifest, such that there is no need to type
>> it. So for us, submitting a job looks like:
>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
>>
>>  Then Hadoop will take care of submitting and distributing, etc. Of
>> course you pay the penalty of always sending all of your dependencies over
>> the wire (the job jar gets replicated to 10 machines by
>> default). Pre-distributing sounds tedious and error prone to me. What if
>> you have different jobs that require different versions of the same
>> dependency?
>>
>>
>>  HTH,
>> Friso
>>
>>
>>
>>
>>
>>  On 16 nov. 2011, at 15:42, Something Something wrote:
>>
>> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
>> 'hadoop jar'.  Also, as per the documentation (
>> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>>
>>  Generic Options
>>
>> The following options are supported by 
>> dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
>> , fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
>> , 
>> fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
>> , job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
>>  and 
>> fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
>> .
>>
>>
>>
>> Does it work for you?  If it does, please let me know.
>>  "Pre-distributing" definitely works, but is that the best way?  If you
>> have a big cluster and Jars are changing often it will be time-consuming.
>>
>> Also, how does Pig do it?  We update Pig UDFs often and put them only on
>> the 'client' machine (machine that starts the Pig job) and the UDF becomes
>> available to all machines in the cluster - automagically!  Is Pig doing the
>> pre-distributing for us?
>>
>> Thanks for your patience & help with our questions.
>>
>>  On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
>> mailinglist...@gmail.com> wrote:
>>
>>> Hmm... there must be a different way 'cause we don't need to do that to
>>> run Pig jobs.
>>>
>>>
>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <daan.ger...@gmail.com>wrote:
>>>
>>>> There might be different ways but currently we are storing our jars
>>>> onto HDFS and register them from there. They will be copied to the machine
>>>> once the job starts. Is that an option?
>>>>
>>>> Daan.
>>>>
>>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>>
>>>> > Until now we were manually copying our Jars to all machines in a
>>>> Hadoop
>>>> > cluster.  This used to work until our cluster size was small.  Now our
>>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>>> that
>>>> > automatically distributes the Jar to all machines in a cluster?
>>>> >
>>>> > I read the doc at:
>>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>>> >
>>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
>>>> that,
>>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>>> >
>>>> > Needless to say, we are getting our feet wet with Hadoop, so
>>>> appreciate
>>>> > your help with our dumb questions.
>>>> >
>>>> > Thanks.
>>>> >
>>>> > PS:  We use Pig a lot, which automatically does this, so there must
>>>> be a
>>>> > clean way to do this.
>>>>
>>>>
>>>
>>
>>
>
>

Re: Distributing our jars to all machines in a cluster

Reply via email to