[ 
https://issues.apache.org/jira/browse/HADOOP-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475343
 ] 

Gautam Kowshik commented on HADOOP-1032:
----------------------------------------

Following are some things we need to address:
 - The distributed cache expects the user to upload the jars/files to hdfs. We 
can do one of the following:

  1.  This can be kept as is and the user can add whatever files (multiple 
jars?) at once in a predefined folder and set that path as the 
"mapred.cache.archives" property in job conf.

   2. We can make this transparent to the user by doing the upload in 
jobclient, Currently the jobclient uploads the jar in hdfs to a mirrored 
location mentioned in "mapred.jar" path.. we can change this (depending on some 
flag) to upload it to a relative path under the cached dir to avoid it being 
cleaned up after the job finishes. something like 
hdfs://CACHE_DIR/JAR_PATH/job.jar

Pros and Cons of 1:
1 is what the caching expects us to do for any out-of-band files and is more 
generic than 2. lesser changes needed than 2. But this proves to be a more 
rigid approach, and since this is a special scenario we can offer some more 
functionality. Also if the jars are not available from the start and there are 
multiple MR jobs ..we'd have to do the copy after every MR job.

Pros and Cons of  2:
2 offers more flexibility. Can be made backward compatible with a flag. Can 
support jar files to be added to the cache whenever needed. Multiple jars 
cannot be cached in one shot. Also, we need changes in more parts in the 
platform flow at jobclient level. 

Both cases need changes at task node level where it looks for the jar in the 
specified cache path and includes it in the classpath before execution.

thoughts?

> Support for caching Job JARs 
> -----------------------------
>
>                 Key: HADOOP-1032
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1032
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.11.2
>            Reporter: Gautam Kowshik
>            Priority: Minor
>
> Often jobs need to be rerun number of times.. like a job that reads from 
> crawled data time and again.. so having to upload job jars to every node is 
> cumbersome. We need a caching mechanism to boost performance. Here are the 
> features for job specific caching of jars/conf files.. 
>  - Ability to resubmit jobs with jars without having to propagate same jar to 
> all nodes.
>     The idea is to keep a store(path mentioned by user in job.xml?) local to 
> the task node so as to speed up task initiation on tasktrackers. Assumes that 
> the jar does not change during an MR task.
> - An independent DFS store to upload jars to (Distributed File Cache?).. that 
> does not cleanup between jobs.
>     This might need user level configuration to indicate to the jobclient to 
> upload files to DFSCache instead of the DFS. 
> https://issues.apache.org/jira/browse/HADOOP-288 facilitates this. Our local 
> cache can be client to the DFS Cache.
> - A standard cache mechanism that checks for changes in the local store and 
> picks from dfs if found dirty.
>    This does away with versioning. The DFSCache supports a md5 checksum 
> check, we can use that.
> Anything else? Suggestions? Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to