[ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
-----------------------------------
    Summary: Spark Pipe() executes the external app by yarn username not the 
current username  (was: Spark Pipe() executes the external app by yarn user not 
the real user)

> Spark Pipe() executes the external app by yarn username not the current 
> username
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-26101
>                 URL: https://issues.apache.org/jira/browse/SPARK-26101
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.3.0
>            Reporter: Maziyar PANAHI
>            Priority: Major
>
> Hello,
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session (Zeppelin, Shell, or spark-submit) my real username is being 
> impersonated successfully. That allows YARN to use the right queue based on 
> the username, also HDFS knows the permissions. (These all work perfectly 
> without any problem. Meaning the cluster has been set up and configured for 
> user impersonation)
> Example (running Spark by user panahi with YARN as a master):
> {code:java}
>  
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
> with view permissions: Set();
> users with modify permissions: Set(panahi); groups with modify permissions: 
> Set()
> ...
> 18/11/17 13:55:52 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: root.multivac
> start time: 1542459353040
> final status: UNDEFINED
> tracking URL: 
> http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
> user: panahi
> {code}
>  
> However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
> This makes it impossible to use an external app such as `c/c++` application 
> that needs read/write access to HDFS because the user `*yarn*` does not have 
> permissions on the user's directory. (also other security and resource 
> management issues by executing all the external apps as yarn username)
> *How to produce this issue:*
> {code:java}
> val test = sc.parallelize(Seq("test user")).repartition(1)
> val piped = test.pipe(Seq("whoami"))
> val c = piped.collect()
> result:
> test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition 
> at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at 
> pipe at <console>:37 c: Array[String] = Array(yarn) 
> {code}
>  
> I believe since Spark is the key actor to invoke this execution inside YARN 
> cluster, Spark needs to respect the actual/current username. Or maybe there 
> is another config for impersonation between Spark and YARN in this situation, 
> but I haven't found any.
>  
> Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to