Re: Distributing our jars to all machines in a cluster

Bejoy Ks Wed, 16 Nov 2011 07:44:54 -0800

Hi
      You can find the usage examples of libjars and files at the following
apache url
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Usage

*"Running wordcount example with -libjars, -files and -archives:
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
mylib.jar -archives myarchive.zip input output Here, myarchive.zip will be
placed and unzipped into a directory by the name "myarchive.zip". "*

I have implemented some map reduce projects successfully with -files option
and I never faced any issue. I have used the libjars option with SQOOP as
well, there also it worked flawlessly.

If your jars keep changing often then libjars would be the preferred
option. If it is kind of static and there are more number of dependent jars
then job tracker doesn't need to distribute jars to task tracker nodes
every time you submit a job, so in those cases predistributing the
dependent jars explicity across nodes one time would be a better approach.

I'm not extremely sure how it works for hive and pig.I believe Pig and hive
would be parsing the pig latin/ hive QL into map reduce jobs and may be the
jobs are packed and distributed across nodes. Though I'm really not sure.
 Experts, please correct me if I'm wrong.
AFAIK shipping jars/files across cluster should be done internally using
libjars/files option in pretty much of all tools that use map reduce under
the hood.

Regards
Bejoy.K.S

On Wed, Nov 16, 2011 at 8:12 PM, Something Something <
mailinglist...@gmail.com> wrote:

> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
> 'hadoop jar'.  Also, as per the documentation (
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>
> Generic Options
>
> The following options are supported by 
> dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
> , fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
> , fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
> , job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
>  and 
> fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
> .
>
>
>
> Does it work for you?  If it does, please let me know.  "Pre-distributing"
> definitely works, but is that the best way?  If you have a big cluster and
> Jars are changing often it will be time-consuming.
>
> Also, how does Pig do it?  We update Pig UDFs often and put them only on
> the 'client' machine (machine that starts the Pig job) and the UDF becomes
> available to all machines in the cluster - automagically!  Is Pig doing the
> pre-distributing for us?
>
> Thanks for your patience & help with our questions.
>
> On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
> mailinglist...@gmail.com> wrote:
>
>> Hmm... there must be a different way 'cause we don't need to do that to
>> run Pig jobs.
>>
>>
>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <daan.ger...@gmail.com>wrote:
>>
>>> There might be different ways but currently we are storing our jars onto
>>> HDFS and register them from there. They will be copied to the machine once
>>> the job starts. Is that an option?
>>>
>>> Daan.
>>>
>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>
>>> > Until now we were manually copying our Jars to all machines in a Hadoop
>>> > cluster.  This used to work until our cluster size was small.  Now our
>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>> that
>>> > automatically distributes the Jar to all machines in a cluster?
>>> >
>>> > I read the doc at:
>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>> >
>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>> >
>>> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
>>> > your help with our dumb questions.
>>> >
>>> > Thanks.
>>> >
>>> > PS:  We use Pig a lot, which automatically does this, so there must be
>>> a
>>> > clean way to do this.
>>>
>>>
>>
>

Re: Distributing our jars to all machines in a cluster

Reply via email to