[
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891674#action_12891674
]
Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------
@Arun - you are right - this is a layer above distributed cache for the most
part. Take a look at our use case (bottom of my previous comments). Essentially
we are extending the Distributed Cache a bit to be a content addressible cache.
I do not think our use case is directly supported by Hadoop for this purpose -
and we are hoping to make the change in the framework (instead of Hive) because
there's nothing Hive specific here and whatever we are doing will be directly
leveraged by other apps.
Sharing != Content addressible. A NFS filer can be globally shared - but it's
not content addressible. An EMC Centera (amongst others) is. Sorry - terrible
examples - trying to come up with something quickly.
Will address Vinod's comments later - we have taken race considerations into
account.
> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars
> and files (because they are using a framework like Hive/Pig) - the same jars
> keep getting uploaded and downloaded repeatedly. The overhead of this
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not
> submit the same files over and again. Identifying and caching execution
> resources by a content signature (md5/sha) would be a good alternative to
> have available.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.