[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891824#action_12891824
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------

> The DistributedCache already tracks mtimes for files

ummmm - that's what i am saying. if u consider objects as immutable - then u 
don't have to track and look up mtimes. part of the goal here is to not have to 
look up mtimes again and again. if u have an object with matching md5 localized 
- you are done. (but we can't use the names alone for that. names can collide. 
md5 cannot (or nearly so). so we name objects based on their content signature 
(md5) - which is what a content addressible store/cache does).

> Admin installs pig/hive on hdfs:
> /share/hive/v1/hive.jar
> /share/hive/v2/hive.jar

that's not how hive works (or how hadoop streaming works). people deploy hive 
on NFS filers or local disks. users run hive jobs from these installation 
points. there's no hdfs involvement anywhere. people add jars to hive or hadoop 
streaming from their personal or shared folders. when people run hive jobs - 
they are not writing java. there's no .setRemoteJar() code they are writing.

hive loads the required jars (from the install directory) to hadoop via hadoop 
libjars/files/archives functionality. different hive clients are not aware of 
each other (ditto for hadoop streaming). most of the hive clients are running 
from common install points - but people may be running from personal install 
points with altered builds.

with what we have done in this patch - all these uncoordinated clients 
automatically share jars with each other. because the name for the shared 
object now is derived from the content of the object. we are still leveraging 
distributed cache - but we are naming objects based on their contents. Junjie 
tells me we can leverage the 'shared' objects namespace from trunk (in 20 we 
added our own shared namespace).

because the names are based on strong content signature - we can make the 
assumption of immutability. as i have tried to point out many times - when 
objects are immutable - one can make optimizations and skip timestamp based 
validation. the latter requires hdfs lookups and creates load and latency.

note that we need zero application changes for this sharing and zero admin 
overhead. so all sorts of hadoop users will automatically start getting the 
benefit a shared jars without writing any code and without any special admin 
recipe.

isn't that good?


> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to