[
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314451#comment-14314451
]
Apache Spark commented on SPARK-5655:
-------------------------------------
User 'growse' has created a pull request for this issue:
https://github.com/apache/spark/pull/4509
> YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster
> configured in secure mode
> -----------------------------------------------------------------------------------------------------
>
> Key: SPARK-5655
> URL: https://issues.apache.org/jira/browse/SPARK-5655
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.2.0
> Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2
> Reporter: Andrew Rowson
> Labels: hadoop
>
> When running a Spark job on a YARN cluster which doesn't run containers under
> the same user as the nodemanager, and also when using the YARN auxiliary
> shuffle service, jobs fail with something similar to:
> {code:java}
> java.io.FileNotFoundException:
> /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
> (Permission denied)
> {code}
> The root cause of this here:
> https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287
> Spark will attempt to chmod 700 any application directories it creates during
> the job, which includes files created in the nodemanager's usercache
> directory. The owner of these files is the container UID, which on a secure
> cluster is the name of the user creating the job, and on an nonsecure cluster
> but with the yarn.nodemanager.container-executor.class configured is the
> value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.
> The problem with this is that the auxiliary shuffle manager runs as part of
> the nodemanager, which is typically running as the user 'yarn'. This can't
> access these files that are only owner-readable.
> YARN already attempts to secure files created under appcache but keep them
> readable by the nodemanager, by setting the group of the appcache directory
> to 'yarn' and also setting the setgid flag. This means that files and
> directories created under this should also have the 'yarn' group. Normally
> this means that the nodemanager should also be able to read these files, but
> Spark setting chmod700 wipes this out.
> I'm not sure what the right approach is here. Commenting out the chmod700
> functionality makes this work on YARN, and still makes the application files
> only readable by the owner and the group:
> {code}
> /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
> # ls -lah
> total 206M
> drwxr-s--- 2 nobody yarn 4.0K Feb 6 18:30 .
> drwxr-s--- 12 nobody yarn 4.0K Feb 6 18:30 ..
> -rw-r----- 1 nobody yarn 206M Feb 6 18:30 shuffle_0_0_0.data
> {code}
> But this may not be the right approach on non-YARN. Perhaps an additional
> step to see if this chmod700 step is necessary (ie non-YARN) is required.
> Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to
> suggest a patch.
> I believe this is a related issue in the MapReduce framwork:
> https://issues.apache.org/jira/browse/MAPREDUCE-3728
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]