[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maziyar PANAHI updated SPARK-26101: ----------------------------------- Summary: Spark Pipe() executes the external app by yarn username not the current username (was: Spark Pipe() executes the external app by yarn user not the real user) > Spark Pipe() executes the external app by yarn username not the current > username > -------------------------------------------------------------------------------- > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 2.3.0 > Reporter: Maziyar PANAHI > Priority: Major > > Hello, > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session (Zeppelin, Shell, or spark-submit) my real username is being > impersonated successfully. That allows YARN to use the right queue based on > the username, also HDFS knows the permissions. (These all work perfectly > without any problem. Meaning the cluster has been set up and configured for > user impersonation) > Example (running Spark by user panahi with YARN as a master): > {code:java} > > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups > with view permissions: Set(); > users with modify permissions: Set(panahi); groups with modify permissions: > Set() > ... > 18/11/17 13:55:52 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: N/A > ApplicationMaster RPC port: -1 > queue: root.multivac > start time: 1542459353040 > final status: UNDEFINED > tracking URL: > http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ > user: panahi > {code} > > However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. > This makes it impossible to use an external app such as `c/c++` application > that needs read/write access to HDFS because the user `*yarn*` does not have > permissions on the user's directory. (also other security and resource > management issues by executing all the external apps as yarn username) > *How to produce this issue:* > {code:java} > val test = sc.parallelize(Seq("test user")).repartition(1) > val piped = test.pipe(Seq("whoami")) > val c = piped.collect() > result: > test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition > at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at > pipe at <console>:37 c: Array[String] = Array(yarn) > {code} > > I believe since Spark is the key actor to invoke this execution inside YARN > cluster, Spark needs to respect the actual/current username. Or maybe there > is another config for impersonation between Spark and YARN in this situation, > but I haven't found any. > > Many thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org