Re: [PR] [SPARK-51156][CONNECT] Static token authentication support to Spark Connect [spark]

via GitHub Wed, 19 Feb 2025 15:55:30 -0800


Kimahriman commented on code in PR #50006:
URL: https://github.com/apache/spark/pull/50006#discussion_r1962544598



##########
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala:
##########
@@ -313,4 +314,21 @@ object Connect {
       .internal()
       .booleanConf
       .createWithDefault(true)
+
+  val CONNECT_AUTHENTICATE_TOKEN =
+    buildStaticConf("spark.connect.authenticate.token")

Review Comment:
   Okay I see what you mean, I was not expect the conf vars set on the Python 
`SparkConf` to end up on the command line, I thought it was just passed in 
memory through py4j. Let me think about that.
   
   There are a lot of use cases to consider here though that I'm trying to 
encompass, which is why the dual config/env var setup.
   
   - `pyspark --remote local` / `spark-shell --remote local`: For the most part 
I assume this is for testing purposes. I'm not sure what the use case is for 
multiple users being on the same machine but they need to prevent others from 
connecting to their own session. This would be the case where seeing the `ps` 
output could be thought of as a security hole. Having the authentication at 
least prevents users on other servers from remotely accessing this connect 
server
   - `pyspark --conf spark.api.mode=connect` / `spark-shell --conf 
spark.api.mode-connect`: I think this is effectively the same thing as the 
previous use case
   - `spark-submit --deploy-mode client --conf spark.api.mode=connect`: Kinda 
similar to the previous two. multiple users on the same machine where a job is 
submitted from, but you don't want them to access your own sessions. I guess if 
multiple users can remotely start sessions this way on a dedicated server, you 
could see the `ps` output from the Spark driver. I don't _think_ this method 
would show the token on the command line, but I would need to verify
   - `spark-submit --deploy-mode cluster --conf spark.api.mode=connect`: This 
is the case I am most worried about from a security perspective. You are 
launching a driver in a shared compute cluster, so anyone else on that cluster 
would be able to access your Spark Connect server without any authentication 
(the reason I brought up the security issue in the beginning). I also don't 
think this would show the token in the command line, but would need to verify



##########
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala:
##########
@@ -313,4 +314,21 @@ object Connect {
       .internal()
       .booleanConf
       .createWithDefault(true)
+
+  val CONNECT_AUTHENTICATE_TOKEN =
+    buildStaticConf("spark.connect.authenticate.token")

Review Comment:
   Okay I see what you mean, I was not expecting the conf vars set on the 
Python `SparkConf` to end up on the command line, I thought it was just passed 
in memory through py4j. Let me think about that.
   
   There are a lot of use cases to consider here though that I'm trying to 
encompass, which is why the dual config/env var setup.
   
   - `pyspark --remote local` / `spark-shell --remote local`: For the most part 
I assume this is for testing purposes. I'm not sure what the use case is for 
multiple users being on the same machine but they need to prevent others from 
connecting to their own session. This would be the case where seeing the `ps` 
output could be thought of as a security hole. Having the authentication at 
least prevents users on other servers from remotely accessing this connect 
server
   - `pyspark --conf spark.api.mode=connect` / `spark-shell --conf 
spark.api.mode-connect`: I think this is effectively the same thing as the 
previous use case
   - `spark-submit --deploy-mode client --conf spark.api.mode=connect`: Kinda 
similar to the previous two. multiple users on the same machine where a job is 
submitted from, but you don't want them to access your own sessions. I guess if 
multiple users can remotely start sessions this way on a dedicated server, you 
could see the `ps` output from the Spark driver. I don't _think_ this method 
would show the token on the command line, but I would need to verify
   - `spark-submit --deploy-mode cluster --conf spark.api.mode=connect`: This 
is the case I am most worried about from a security perspective. You are 
launching a driver in a shared compute cluster, so anyone else on that cluster 
would be able to access your Spark Connect server without any authentication 
(the reason I brought up the security issue in the beginning). I also don't 
think this would show the token in the command line, but would need to verify



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-51156][CONNECT] Static token authentication support to Spark Connect [spark]

Reply via email to