[
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891836#action_12891836
]
Junjie Liang commented on MAPREDUCE-1901:
-----------------------------------------
To supplement Joydeep's comment:
We are trying to save the number of calls to the NameNode, through the
following optimizations:
1) Currently, files loaded through hadoop libjars/files/archives mechanism are
copied onto HDFS and removed on every job. This is inefficient if most jobs are
submitted from only 3-4 versions of hive, because rightfully the files should
persist in HDFS to be reused. Hence the idea of decoupling files with their
jobId to make them sharable across jobs.
2) If files are identified with their md5 checksums, we no longer need to
verify file modification time in the TT. This saves another call to the
NameNode to get the FileStatus object.
The reduction in the number of calls to the NameNode is small, but over a large
number of jobs we believe it will be a noticeable difference.
> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars
> and files (because they are using a framework like Hive/Pig) - the same jars
> keep getting uploaded and downloaded repeatedly. The overhead of this
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not
> submit the same files over and again. Identifying and caching execution
> resources by a content signature (md5/sha) would be a good alternative to
> have available.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.