[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

Joydeep Sen Sarma (JIRA) Thu, 22 Jul 2010 22:19:23 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891471#action_12891471
 ]


Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------

thanks for taking a look. i think there are some differences (and potentially 
some overlap as well) with what we are trying to do here:

1. the jobclient in this approach computes md5 of jars/files/archives (when a 
special option is enabled) and then automatically submits these jars as shared 
objects by putting them in a global namespace - where the (md5,file-name) 
identifies the shared object. (instead of the (jobid, file-name, file-timestamp)

2. it treats shared objects as immutable. meaning that we never look up the 
timestamp of the backing object in hdfs during task localization/validation. 
this saves time during task setup. 

3. reasonable effort has been put to bypass as many hdfs calls as possible in 
step 1. the client gets a listing of all shared objects and their md5 
signatures in one shot. because of the immutability assumption - individual 
file stamps are never required and save hdfs calls.

4. finally - there is inbuilt code to do garbage collection of the shared 
namespace (in hdfs)  by deleting old shared objects that have not been recently 
accessed.

so i believe the scope of this effort is somewhat different (based on looking 
at the last patch for 744).

the difference here is that all applications (like Hive) using libjars etc. 
options provided in hadoop automatically share jars with each other (when they 
set this option). the applications don't have to do anything special (like 
figuring out the right global identifier in hdfs for their jars).

Our primary use case is for Hive. Hive submits multiple jars for each Hadoop 
job. Users can add more. At any given time - we have at least 4-5 official 
versions of Hive being used to submit jobs. in addition - hive developers are 
developing custom builds and submitting jobs using them. total jobs submitted 
per day is tens of thousands.

with this patch - we automatically get sharing of jars and zero administration 
overhead of managing a global namespace amongst many versions of our software 
libraries. I believe there's nothing Hive specific here. We use hadoop jar/file 
resources just like hadoop-streaming and other map-reduce jobs.

before embarking on this venture - we looked at the hadoop code and tried to 
find out whether a similar facility existed. we noticed a md5 class - but no 
uses for it. if there is existing functionality to the above effect - we would 
love to pick it up (less work for us). otherwise - i think this is very useful 
functionality that would be good to have in Hadoop framework.

if u can look at the patch a bit - that might help understand the differences 
as well. 

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

Reply via email to