[ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
-----------------------------------
    Description: 
Hello,

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user panahi with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
<console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
<console>:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user `panahi` with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
<console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
<console>:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> ---------------------------------------------------------------------
>
>                 Key: SPARK-26101
>                 URL: https://issues.apache.org/jira/browse/SPARK-26101
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.3.0
>            Reporter: Maziyar PANAHI
>            Priority: Major
>
> Hello,
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session (Zeppelin, Shell, or spark-submit) my real username is being 
> impersonated successfully. That allows YARN to use the right queue based on 
> the username, also HDFS knows the permissions. (These all work perfectly 
> without any problem. Meaning the cluster has been set up and configured for 
> user impersonation)
> Example (running Spark by user panahi with YARN as a master):
> {code:java}
>  
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
> with view permissions: Set();
> users with modify permissions: Set(panahi); groups with modify permissions: 
> Set()
> ...
> 18/11/17 13:55:52 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: root.multivac
> start time: 1542459353040
> final status: UNDEFINED
> tracking URL: 
> http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
> user: panahi
> {code}
>  
> However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
> This makes it impossible to use an external app such as `c/c++` application 
> that needs read/write access to HDFS because the user `*yarn*` does not have 
> permissions on the user's directory. (also other security and resource 
> management issues by executing all the external apps as yarn username)
> *How to produce this issue:*
> {code:java}
> val test = sc.parallelize(Seq("test user")).repartition(1)
> val piped = test.pipe(Seq("whoami"))
> val c = piped.collect()
> result:
> test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition 
> at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at 
> pipe at <console>:37 c: Array[String] = Array(yarn) 
> {code}
>  
> I believe since Spark is the key actor to invoke this execution inside YARN 
> cluster, Spark needs to respect the actual/current username. Or maybe there 
> is another config for impersonation between Spark and YARN in this situation, 
> but I haven't found any.
>  
> Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to