Maziyar PANAHI created SPARK-26101:
--------------------------------------
Summary: Spark Pipe() executes the external app by yarn user not
the real user
Key: SPARK-26101
URL: https://issues.apache.org/jira/browse/SPARK-26101
Project: Spark
Issue Type: Bug
Components: YARN
Affects Versions: 2.3.0
Reporter: Maziyar PANAHI
Hello,
I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark
session (Zeppelin, Shell, or spark-submit) my real username is being
impersonated successfully. That allows YARN to use the right queue based on the
username, also HDFS knows the permissions.
Example (running Spark by user `panahi`):
```
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions:
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: *panahi*
```
However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This
makes it impossible to use a `c/c++` application that needs read/write access
to HDFS because the user `yarn` does not have permissions on the user's
directory.
How to produce this issue:
```scala
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
*result:*
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at
<console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at
<console>:37 c: Array[String] = Array(*yarn*)
```
I believe since Spark is the key actor to invoke this execution inside YARN
cluster, Spark needs to respect the actual/current username. Or maybe there is
another config for impersonation between Spark and YARN in this situation, but
I haven't found any.
Many thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]