[
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891529#action_12891529
]
Vinod K V commented on MAPREDUCE-1901:
--------------------------------------
bq. Currently, auxiliary files added through DistributedCache.addCacheFiles
and DistributedCache.addCacheArchive end up in {mapred.system.dir}/job_id/files
or {mapred.system.dir}/job_id/archives. The /job_id directory is then removed
after every job, which is why files cannot be reused across jobs.
That is only true for private distributed cache files. Artifacts which are
already public on the DFS don't go to mapredsystem directly at all and are
reusable across users/jobs.
bq. 2. it treats shared objects as immutable. meaning that we never look up the
timestamp of the backing object in hdfs during task localization/validation.
this saves time during task setup.
bq. 3. reasonable effort has been put to bypass as many hdfs calls as possible
in step 1. the client gets a listing of all shared objects and their md5
signatures in one shot. because of the immutability assumption - individual
file stamps are never required and save hdfs calls.
I think this is an orthogonal. If md5 checksums are preferred over timestamp
based checks for the sake of lessening DFS accesses, that can be done
separately within the current design, no? Distributed cache files originally
did rely on md5 checksum of the files/jars that HDFS itself used to have.
However that changed via HADOOP-1084 when checksums paved way for block level
crcs.
bq. 4. finally - there is inbuilt code to do garbage collection of the shared
namespace (in hdfs) by deleting old shared objects that have not been recently
accessed.
This is where I think it gets tricky. First, garbage collection of the dfs
namespace should be accompanied by the same on individual TTs - more complexity.
There are race conditions too. It's not clear how the JobTracker is prevented
from expiring shared cache files/jars when some JobClient has already marked or
is in the process of marking those artifacts for usage by the job. Warranting
such synchronization across JobTracker and JobClients is difficult and, at
best, brittle. Leaving the synchronization issues unsolved would only mean
leaving the tasks/job to fail later which is not desirable.
bq. the difference here is that all applications (like Hive) using libjars etc.
options provided in hadoop automatically share jars with each other (when they
set this option). the applications don't have to do anything special (like
figuring out the right global identifier in hdfs for their jars).
That seems like a valid use-case. But as I mentioned above, because of
complexity and race conditions it seems like a wrong place to develop it.
I think the core problem is trying to perform a service (sharing of files) that
strictly belongs to the layer above mapreduce - maintaining the share list
doesn't seem like a JT's responsibility. The current way of leaving it to the
users to decide which are public files(and hence shareable) and which are not
and how and when they are purged, keeps things saner from the mapreduce
framework point of view. What do you think?
bq. if u can look at the patch a bit - that might help understand the
differences as well
I looked at the patch. And I am still not convinced. Yet, that is.
> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars
> and files (because they are using a framework like Hive/Pig) - the same jars
> keep getting uploaded and downloaded repeatedly. The overhead of this
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not
> submit the same files over and again. Identifying and caching execution
> resources by a content signature (md5/sha) would be a good alternative to
> have available.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.