[jira] [Comment Edited] (SPARK-46860) Credentials with https url not working for --jars, --files, --archives & --py-files options on spark-submit command

Krzysztof Ruta (Jira) Wed, 26 Mar 2025 02:38:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-46860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938513#comment-17938513
 ]


Krzysztof Ruta edited comment on SPARK-46860 at 3/26/25 9:37 AM:
-----------------------------------------------------------------

I did some research and experiments. I identified two places where URL 
containing credentials is potentially logged - this applies particularly to 
pt.2 above. But as soon as I addressed these I found another... E.g. Spark 
stores its jars location in session properties (spark.jars), what if somebody 
decides to log full spark config for debugging purposes? Or what if somebody 
logs full spark-submit commant (that includes URL) even before spark app is 
launched?

I don't think this is the way to go, I mean to alter Spark logging in order to 
keep secrets safe. This would give you a false sense of confidence that your 
password would not leak in any case. You cannot be sure that in some scenarios 
(network problems, wrong characters in password, debug level logging etc.) the 
URL would not be logged.

So in my opinion the key here is to secure your logging system independently of 
Spark. Take Apache Airflow of Gitlab CI/CD - either you are explicitly given 
the option to mask your secrets or you must do it manually, try to go this way. 
In any scenario I can think of this is a safer approach.

To test it, just put obviously incorrect credentials (like you mentioned above) 
or correct ones that you can quickly change and search for them in logs. When 
masked, you should never see them.


was (Author: JIRAUSER309126):
I did some research and experiments. I identified two places where URL 
containing credentials is potentially logged - this applies particularly to 
pt.2 above. But as soon as I addressed it I found another... E.g. Spark stores 
its jars location in session properties (spark.jars), what if somebody decides 
to log full spark config for debugging purposes? Or what if somebody logs full 
spark-submit commant (that includes URL) even before spark app is launched?

I don't think this is the way to go, I mean to alter Spark logging in order to 
keep secrets safe. This would give you a false sense of confidence that your 
password would not leak in any case. You cannot be sure that in some scenarios 
(network problems, wrong characters in password, debug level logging etc.) the 
URL would not be logged.

So in my opinion the key here is to secure your logging system independently of 
Spark. Take Apache Airflow of Gitlab CI/CD - either you are explicitly given 
the option to mask your secrets or you must do it manually, try to go this way. 
In any scenario I can think of this is a safer approach.

To test it, just put obviously incorrect credentials (like you mentioned above) 
or correct ones that you can quickly change and search for them in logs. When 
masked, you should never see them.

> Credentials with https url not working for --jars, --files, --archives & 
> --py-files options on spark-submit command
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-46860
>                 URL: https://issues.apache.org/jira/browse/SPARK-46860
>             Project: Spark
>          Issue Type: Task
>          Components: k8s
>    Affects Versions: 3.3.3, 3.5.0, 3.3.4
>         Environment: Spark 3.3.3 deployed on K8s 
>            Reporter: Vikram Janarthanan
>            Priority: Major
>              Labels: pull-request-available
>
> We are trying to run the spark application by pointing the dependent files as 
> well the main pyspark script from secure webserver
> We are looking for solution to pass the dependencies as well as pysaprk 
> script from webserver.
> we have tried deploying the spark application from webserver to k8s cluster 
> without username and password and it worked, but when tried with 
> username/password we are facing "Exception in thread "{*}main" 
> java.io.IOException: Server returned HTTP response code: 401 for URL: 
> https://username:[email protected]/application/pysparkjob.py{*}";
> *Working  options on spark-submit:*
> spark-submit ......
> --repositories https://username:[email protected]/repo1/repo
> --jars https://domain.com/jars/runtime.jar \
> --files https://domain.com/files/query.sql \
> --py-files [https://domain.com/pythonlib/pythonlib.zip] \
> https://domain.com/app1/pysparkapp.py
> Note: only repositories option works with username and password
> *Spark-submit using https url with username/password not working:*
> spark-submit ......
> --jars https://username:[email protected]/jars/runtime.jar \
> --files https://username:[email protected]/files/query.sql \
> --py-files 
> https://username:[email protected][/pythonlib/pythonlib.zip|https://domain.com/pythonlib/pythonlib.zip]
>  \
> https://username:[email protected]/app1/pysparkapp.py
>  
> Error :
> 25/01/23 09:19:57 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.io.IOException: Server returned HTTP response 
> code: 401 for URL: 
> https://username:[email protected]/repository/spark-artifacts/pysparkdemo/1.0/pysparkdemo-1.0.tgz
>         at 
> java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
>         at 
> java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
>         at 
> java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:224)
>         at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:809)
>         at 
> org.apache.spark.util.DependencyUtils$.downloadFile(DependencyUtils.scala:264)
>         at 
> org.apache.spark.util.DependencyUtils$.$anonfun$downloadFileList$2(DependencyUtils.scala:233)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at 
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-46860) Credentials with https url not working for --jars, --files, --archives & --py-files options on spark-submit command

Reply via email to