[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887614#action_12887614
 ] 

Junjie Liang commented on MAPREDUCE-1901:
-----------------------------------------

Here is some more details on how I intend to make the changes. Please comment 
and give suggestions as you see fit.

Currently, auxiliary files added through {{DistributedCache.addCacheFiles}} and 
{{DistributedCache.addCacheArchive}} end up in {mapred.system.dir}/job_id/files 
or {mapred.system.dir}/job_id/archives. The /job_id directory is then removed 
after every job, which is why files cannot be reused across jobs.

I'm proposing a change in the way files are stored in HDFS. Instead of storing 
files in /jobid/files or /jobid/archives, we store them directly in 
{mapred.system.dir}/files and {mapred.system.dir}/archives. This removes the 
association between a file and the job ID, so that files can be persistent 
across jobs.

Two new function calls: {{DistributedCache.addSharedCacheFiles()}} and 
{{DistributedCache.addSharedCacheArchives()}} are added for users to add files 
that can be shared across jobs. Files that are added through the original 
functions {{addCacheFiles()}} and {{addSharedArchives()}} are not affected; 
they go through the same code path as before.

The "shared" files are stored in {mapred.system.dir}/files and 
{mapred.system.dir}/archives (note the job_id is removed from the path). To 
prevent files with the same filename from colliding, a prefix which is the md5 
of the file is added to the filename of each file, so for example, test.txt 
becomes ab876d86389d76c9e692fffd51bb2acde_test.txt. We use both the md5 
checksum and filename to identify a file so there is no confusion between files 
with the same filename but have different contents, and files with the same 
contents but with different filenames.

The TaskRunner no longer needs to use timestamps to decide whether a file is up 
to date, since the file will have a different md5 checksum if it is modified.

Files that need to be changed: JobClient.java, DistributedCache.java, and 
TaskRunner.java have the most changes, since files move from the client to HDFS 
to the tasktracker nodes through these codes.

Thanks!

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to