[
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891824#action_12891824
]
Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------
> The DistributedCache already tracks mtimes for files
ummmm - that's what i am saying. if u consider objects as immutable - then u
don't have to track and look up mtimes. part of the goal here is to not have to
look up mtimes again and again. if u have an object with matching md5 localized
- you are done. (but we can't use the names alone for that. names can collide.
md5 cannot (or nearly so). so we name objects based on their content signature
(md5) - which is what a content addressible store/cache does).
> Admin installs pig/hive on hdfs:
> /share/hive/v1/hive.jar
> /share/hive/v2/hive.jar
that's not how hive works (or how hadoop streaming works). people deploy hive
on NFS filers or local disks. users run hive jobs from these installation
points. there's no hdfs involvement anywhere. people add jars to hive or hadoop
streaming from their personal or shared folders. when people run hive jobs -
they are not writing java. there's no .setRemoteJar() code they are writing.
hive loads the required jars (from the install directory) to hadoop via hadoop
libjars/files/archives functionality. different hive clients are not aware of
each other (ditto for hadoop streaming). most of the hive clients are running
from common install points - but people may be running from personal install
points with altered builds.
with what we have done in this patch - all these uncoordinated clients
automatically share jars with each other. because the name for the shared
object now is derived from the content of the object. we are still leveraging
distributed cache - but we are naming objects based on their contents. Junjie
tells me we can leverage the 'shared' objects namespace from trunk (in 20 we
added our own shared namespace).
because the names are based on strong content signature - we can make the
assumption of immutability. as i have tried to point out many times - when
objects are immutable - one can make optimizations and skip timestamp based
validation. the latter requires hdfs lookups and creates load and latency.
note that we need zero application changes for this sharing and zero admin
overhead. so all sorts of hadoop users will automatically start getting the
benefit a shared jars without writing any code and without any special admin
recipe.
isn't that good?
> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars
> and files (because they are using a framework like Hive/Pig) - the same jars
> keep getting uploaded and downloaded repeatedly. The overhead of this
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not
> submit the same files over and again. Identifying and caching execution
> resources by a content signature (md5/sha) would be a good alternative to
> have available.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.