Re: Distributing our jars to all machines in a cluster

Dmitriy Ryaboy Wed, 16 Nov 2011 10:00:46 -0800

Libjars works if your MR job is initialized correctly. Here's a code snippet:


  public static void main(String[] args) throws Exception {
    GenericOptionsParser optParser = new GenericOptionsParser(args);
    int exitCode = ToolRunner.run(optParser.getConfiguration(),
        new MyMRJob(),
        optParser.getRemainingArgs());
    System.exit(exitCode);
  }

Pig works by re-jarring your whole application, and there's an
outstanding patch to make it run libjars -- which works, I've been
running it in production at Twitter.

-D

On Wed, Nov 16, 2011 at 9:00 AM, Something Something
<mailinglist...@gmail.com> wrote:
> I agree.  It will eventually get us in trouble.  That's why we want to get
> the -libjars option to work, but it's not working.. arrrghhh..  It's the
> simplest things in engineering that take the longest time... -:)
>
> Can you see why this may not work?
>
> /Users/xyz/hadoop-0.20.2/bin/hadoop jar
> /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
> -libjars /Users/xyz/modules/something/target/my.jar,
> /Users/xyz/avro-tools-1.5.4.jar
>
> On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven
> <fvanvollenho...@xebia.com> wrote:
>>
>> You use maven jar-with-deps default assembly? That layout works too, but
>> it will give you problems eventually when you have different classes with
>> the same package and name.
>> Java jar files are regular ZIP files. They can contain duplicate entries.
>> I don't know whether your packaging creates duplicates in them, but if it
>> does, it could be the cause of your problem.
>> Try checking your jar for a duplicate license dir in the META-INF
>> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort |
>> uniq -d)
>>
>> Friso
>>
>> On 16 nov. 2011, at 17:33, Something Something wrote:
>>
>> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
>> I get this:
>>
>> Mkdirs failed to create
>> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
>>
>>
>> Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
>> you have described it.  It doesn't contain any JARs, but it has all the
>> classes from all the dependent JARs.
>>
>>
>> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven
>> <fvanvollenho...@xebia.com> wrote:
>>>
>>> We usually package my jobs as a single jar that contains a /lib directory
>>> in the jar that contains all other jars that the job code depends on. Hadoop
>>> understands this layout when run as 'hadoop jar'. So the jar layout would be
>>> something like:
>>> /META-INF/manifest.mf
>>> /com/mypackage/MyMapperClass.class
>>> /com/mypackage/MyReducerClass.class
>>> /lib/dependency1.jar
>>> /lib/dependency2.jar
>>> etc.
>>> If you use Maven or some other build tool with dependency management, you
>>> can usually produce this jar as part of your build. We also have Maven write
>>> the main class to the manifest, such that there is no need to type it. So
>>> for us, submitting a job looks like:
>>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
>>> Then Hadoop will take care of submitting and distributing, etc. Of course
>>> you pay the penalty of always sending all of your dependencies over the wire
>>> (the job jar gets replicated to 10 machines by default). Pre-distributing
>>> sounds tedious and error prone to me. What if you have different jobs that
>>> require different versions of the same dependency?
>>>
>>> HTH,
>>> Friso
>>>
>>>
>>>
>>>
>>> On 16 nov. 2011, at 15:42, Something Something wrote:
>>>
>>> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
>>> 'hadoop jar'.  Also, as per the documentation
>>> (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>>>
>>> Generic Options
>>>
>>> The following options are supported
>>> by dfsadmin, fs, fsck, job and fetchdt.
>>>
>>>
>>>
>>> Does it work for you?  If it does, please let me know.
>>>  "Pre-distributing" definitely works, but is that the best way?  If you have
>>> a big cluster and Jars are changing often it will be time-consuming.
>>>
>>> Also, how does Pig do it?  We update Pig UDFs often and put them only on
>>> the 'client' machine (machine that starts the Pig job) and the UDF becomes
>>> available to all machines in the cluster - automagically!  Is Pig doing the
>>> pre-distributing for us?
>>>
>>> Thanks for your patience & help with our questions.
>>>
>>> On Wed, Nov 16, 2011 at 6:29 AM, Something Something
>>> <mailinglist...@gmail.com> wrote:
>>>>
>>>> Hmm... there must be a different way 'cause we don't need to do that to
>>>> run Pig jobs.
>>>>
>>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <daan.ger...@gmail.com>
>>>> wrote:
>>>>>
>>>>> There might be different ways but currently we are storing our jars
>>>>> onto HDFS and register them from there. They will be copied to the machine
>>>>> once the job starts. Is that an option?
>>>>>
>>>>> Daan.
>>>>>
>>>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>>>
>>>>> > Until now we were manually copying our Jars to all machines in a
>>>>> > Hadoop
>>>>> > cluster.  This used to work until our cluster size was small.  Now
>>>>> > our
>>>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>>>> > that
>>>>> > automatically distributes the Jar to all machines in a cluster?
>>>>> >
>>>>> > I read the doc at:
>>>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>>>> >
>>>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
>>>>> > that,
>>>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>>>> >
>>>>> > Needless to say, we are getting our feet wet with Hadoop, so
>>>>> > appreciate
>>>>> > your help with our dumb questions.
>>>>> >
>>>>> > Thanks.
>>>>> >
>>>>> > PS:  We use Pig a lot, which automatically does this, so there must
>>>>> > be a
>>>>> > clean way to do this.
>>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Distributing our jars to all machines in a cluster

Reply via email to