[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953441#comment-15953441
 ] 

Jason Lowe commented on MAPREDUCE-6874:
---------------------------------------

This is a limitation with distributed cache.  It can be very expensive to do a 
full-depth traversal of a directory tree, and the API only supports one 
timestamp for a distributed cache entry.  Not only is it expensive to perform 
the stats of the tree in order to see if it is changed, it's also expensive to 
localize the files.  There's RPC overhead for each file in the tree.

It is much more efficient, and safer, for an archive (e.g.: .tar.gz, .zip, 
etc.) to be used instead of a directory.  Then there's only one timestamp we 
need to check to know if anything in the "tree" has changed.  Arguably 
directory trees shouldn't be supported in the distributed cache at all, but I 
believe they were added way back when to support use cases where a chain of 
MapReduce jobs needed the output of a previous job (i.e.: a directory) to be 
used as a cache file for the next job (e.g.: a map-side join).

> Make DistributedCache check if the content of a directory has changed
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6874
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6874
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>
> DistributedCache does not check recursively if the content a directory has 
> changed when adding files to it with {{DistributedCache.addCacheFile()}}. 
> h5. Background
> I have an Oozie workflow on HDFS:
> {code}
> example_workflow
> ├── job.properties
> ├── lib
> │   ├── components
> │   │   ├── sub-component.sh
> │   │   └── subsub
> │   │       └── subsub.sh
> │   ├── main.sh
> │   └── sub.sh
> └── workflow.xml
> {code}
> Executed the workflow; then made some changes in {{subsub.sh}}. Replaced the 
> file on HDFS. When I re-ran the workflow, DistributedCache did not notice the 
> changes as the timestamp on the {{components}} directory did not change. As a 
> result, the old script was materialized.
> This behaviour might be related to [determineTimestamps() 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/filecache/ClientDistributedCacheManager.java#L84].
> In order to use the new script during workflow execution, I had to update the 
> whole {{components}} directory.
> h6. Some more info:
> In Oozie, [DistributedCache.addCacheFile() 
> |https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L625]
>  is used to add files to the distributed cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to