[
https://issues.apache.org/jira/browse/MAPREDUCE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685822#comment-13685822
]
Bikas Saha commented on MAPREDUCE-5278:
---------------------------------------
We should add a config name and default value instead of directly referencing
the string.
{code}
+ String [] accessibleSchemes = job.getStrings(
+ "mapreduce.client.accessible.remote.schemes");
{code}
Is the following code (marked below) continuing to copy stuff to the default fs
(fs) when the newPath points to a different filesystem?
{code}
Path newPath = copyRemoteFiles(fs, libjarsDir, tmp, job, replication);
DistributedCache.addArchiveToClassPath
- (new Path(newPath.toUri().getPath()), job, fs);
+ (new Path(newPath.toUri().getPath()), job,
newPath.getFileSystem(job));
{code}
> Distributed cache is broken when JT staging dir is not on the default FS
> ------------------------------------------------------------------------
>
> Key: MAPREDUCE-5278
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: distributed-cache
> Affects Versions: 1-win
> Environment: Windows
> Reporter: Xi Fang
> Assignee: Xi Fang
> Fix For: 1-win
>
> Attachments: MAPREDUCE-5278.2.patch, MAPREDUCE-5278.patch
>
>
> Today, the JobTracker staging dir ("mapreduce.jobtracker.staging.root.dir) is
> set to point to HDFS, even though other file systems (e.g. Amazon S3 file
> system and Windows ASV file system) are the default file systems.
> For ASV, this config was chosen and there are a few reasons why:
> 1. To prevent leak of the storage account credentials to the user's storage
> account;
> 2. It uses HDFS for the transient job files what is good for two reasons – a)
> it does not flood the user's storage account with irrelevant data/files b) it
> leverages HDFS locality for small files
> However, this approach conflicts with how distributed cache caching works,
> completely negating the feature's functionality.
> When files are added to the distributed cache (thru files/achieves/libjars
> hadoop generic options), they are copied to the job tracker staging dir only
> if they reside on a file system different that the jobtracker's. Later on,
> this path is used as a "key" to cache the files locally on the tasktracker's
> machine, and avoid localization (download/unzip) of the distributed cache
> files if they are already localized.
> In this configuration the caching is completely disabled and we always end up
> copying dist cache files to the job tracker's staging dir first and
> localizing them on the task tracker machine second.
> This is especially not good for Oozie scenarios as Oozie uses dist cache to
> populate Hive/Pig jars throughout the cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira