[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891529#action_12891529
 ] 

Vinod K V commented on MAPREDUCE-1901:
--------------------------------------

bq. Currently, auxiliary files added through DistributedCache.addCacheFiles  
and DistributedCache.addCacheArchive end up in {mapred.system.dir}/job_id/files 
or {mapred.system.dir}/job_id/archives. The /job_id directory is then removed 
after every job, which is why files cannot be reused across jobs.
That is only true for private distributed cache files. Artifacts which are 
already public on the DFS don't go to mapredsystem directly at all and are 
reusable across users/jobs.

bq. 2. it treats shared objects as immutable. meaning that we never look up the 
timestamp of the backing object in hdfs during task localization/validation. 
this saves time during task setup.
bq. 3. reasonable effort has been put to bypass as many hdfs calls as possible 
in step 1. the client gets a listing of all shared objects and their md5 
signatures in one shot. because of the immutability assumption - individual 
file stamps are never required and save hdfs calls.
I think this is an orthogonal. If md5 checksums are preferred over timestamp 
based checks for the sake of lessening DFS accesses, that can be done 
separately within the current design, no? Distributed cache files originally 
did rely on md5 checksum of the files/jars that HDFS itself used to have. 
However that changed via HADOOP-1084 when checksums paved way for block level 
crcs.

bq. 4. finally - there is inbuilt code to do garbage collection of the shared 
namespace (in hdfs) by deleting old shared objects that have not been recently 
accessed.
This is where I think it gets tricky. First, garbage collection of the dfs 
namespace should be accompanied by the same on individual TTs - more complexity.

There are race conditions too. It's not clear how the JobTracker is prevented 
from expiring shared cache files/jars when some JobClient has already marked or 
is in the process of marking those artifacts for usage by the job. Warranting 
such synchronization across JobTracker and JobClients is difficult and, at 
best, brittle. Leaving the synchronization issues unsolved would only mean 
leaving the tasks/job to fail later which is not desirable.

bq. the difference here is that all applications (like Hive) using libjars etc. 
options provided in hadoop automatically share jars with each other (when they 
set this option). the applications don't have to do anything special (like 
figuring out the right global identifier in hdfs for their jars).
That seems like a valid use-case. But as I mentioned above, because of 
complexity and race conditions it seems like a wrong place to develop it. 

I think the core problem is trying to perform a service (sharing of files) that 
strictly belongs to the layer above mapreduce - maintaining the share list 
doesn't seem like a JT's responsibility. The current way of leaving it to the 
users to decide which are public files(and hence shareable) and which are not 
and how and when they are purged, keeps things saner from the mapreduce 
framework point of view. What do you think?

bq. if u can look at the patch a bit - that might help understand the 
differences as well
I looked at the patch. And I am still not convinced. Yet, that is.

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to