Denes Bodo created OOZIE-3227:
---------------------------------
Summary: Eliminate duplicated dependencies from distributed cache
Key: OOZIE-3227
URL: https://issues.apache.org/jira/browse/OOZIE-3227
Project: Oozie
Issue Type: Sub-task
Components: core
Affects Versions: 5.0.0
Reporter: Denes Bodo
Assignee: Denes Bodo
Using Hadoop 3 it is not allowed to have multiple dependencies with same file
names on the list of *mapreduce.job.cache.files*.
The issue occurs when I have the same file name on multiple sharelib folders
and/or my application's lib folder. This can be avoided but not easy all the
time.
I suggest to remove the duplicates from this list.
A quick workaround for the source code in JavaActionExecutor is like:
{code}
removeDuplicatedDependencies(launcherJobConf,
"mapreduce.job.cache.files");
removeDuplicatedDependencies(launcherJobConf,
"mapreduce.job.cache.archives");
......
private void removeDuplicatedDependencies(JobConf conf, String key) {
final Map<String, String> nameToPath = new HashMap<>();
StringBuilder uniqList = new StringBuilder();
for(String dependency: conf.get(key).split(",")) {
final String[] arr = dependency.split("/");
final String dependencyName = arr[arr.length - 1];
if(nameToPath.containsKey(dependencyName)) {
LOG.warn(dependencyName + " [" + dependency + "] is already
defined in " + key + ". Skipping...");
} else {
nameToPath.put(dependencyName, dependency);
uniqList.append(dependency).append(",");
}
}
uniqList.setLength(uniqList.length() - 1);
conf.set(key, uniqList.toString());
}
{code}
Other way is to eliminate the deprecated
*org.apache.hadoop.filecache.DistributedCache*.
I am going to have a deeper understanding how we should use distributed cache
and all the comments are welcome.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)