Support for caching Job JARs 
-----------------------------

                 Key: HADOOP-1032
                 URL: https://issues.apache.org/jira/browse/HADOOP-1032
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
            Reporter: Gautam Kowshik
            Priority: Minor


Often jobs need to be rerun number of times.. like a job that reads from 
crawled data time and again.. so having to upload job jars to every node is 
cumbersome. We need a caching mechanism to boost performance. Here are the 
features for job specific caching of jars/conf files.. 

 - Ability to resubmit jobs with jars without having to propagate same jar to 
all nodes.
    The idea is to keep a store(path mentioned by user in job.xml?) local to 
the task node so as to speed up task initiation on tasktrackers. Assumes that 
the jar does not change during an MR task.

- An independent DFS store to upload jars to (Distributed File Cache?).. that 
does not cleanup between jobs.
    This might need user level configuration to indicate to the jobclient to 
upload files to DFSCache instead of the DFS. 
https://issues.apache.org/jira/browse/HADOOP-288 facilitates this. Our local 
cache can be client to the DFS Cache.

- A standard cache mechanism that checks for changes in the local store and 
picks from dfs if found dirty.
   This does away with versioning. The DFSCache supports a md5 checksum check, 
we can use that.

Anything else? Suggestions? Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to