[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892206#action_12892206
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------

sorry for the endless confusion - i will try to write up a detailed doc 
tomorrow covering use cases and design/gaps etc.

the use case involves libjars being added from local file systems (since that's 
where software packages are deployed). it's really not possible to deploy 
software packages on hdfs (in certain cases - we wish to execute the software 
locally without interacting with hdfs entirely (see for example HIVE-1408)). 

the changes to distributed cache (of which there are little - i think most 
changes are in jobclient and taskrunner) are concerned with making the 
assumption that the shared objects are immutable (in which case mtime checks 
can be bypassed). 

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to