[
https://issues.apache.org/jira/browse/SPARK-46860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938513#comment-17938513
]
Krzysztof Ruta edited comment on SPARK-46860 at 3/26/25 9:37 AM:
-----------------------------------------------------------------
I did some research and experiments. I identified two places where URL
containing credentials is potentially logged - this applies particularly to
pt.2 above. But as soon as I addressed these I found another... E.g. Spark
stores its jars location in session properties (spark.jars), what if somebody
decides to log full spark config for debugging purposes? Or what if somebody
logs full spark-submit commant (that includes URL) even before spark app is
launched?
I don't think this is the way to go, I mean to alter Spark logging in order to
keep secrets safe. This would give you a false sense of confidence that your
password would not leak in any case. You cannot be sure that in some scenarios
(network problems, wrong characters in password, debug level logging etc.) the
URL would not be logged.
So in my opinion the key here is to secure your logging system independently of
Spark. Take Apache Airflow of Gitlab CI/CD - either you are explicitly given
the option to mask your secrets or you must do it manually, try to go this way.
In any scenario I can think of this is a safer approach.
To test it, just put obviously incorrect credentials (like you mentioned above)
or correct ones that you can quickly change and search for them in logs. When
masked, you should never see them.
was (Author: JIRAUSER309126):
I did some research and experiments. I identified two places where URL
containing credentials is potentially logged - this applies particularly to
pt.2 above. But as soon as I addressed it I found another... E.g. Spark stores
its jars location in session properties (spark.jars), what if somebody decides
to log full spark config for debugging purposes? Or what if somebody logs full
spark-submit commant (that includes URL) even before spark app is launched?
I don't think this is the way to go, I mean to alter Spark logging in order to
keep secrets safe. This would give you a false sense of confidence that your
password would not leak in any case. You cannot be sure that in some scenarios
(network problems, wrong characters in password, debug level logging etc.) the
URL would not be logged.
So in my opinion the key here is to secure your logging system independently of
Spark. Take Apache Airflow of Gitlab CI/CD - either you are explicitly given
the option to mask your secrets or you must do it manually, try to go this way.
In any scenario I can think of this is a safer approach.
To test it, just put obviously incorrect credentials (like you mentioned above)
or correct ones that you can quickly change and search for them in logs. When
masked, you should never see them.
> Credentials with https url not working for --jars, --files, --archives &
> --py-files options on spark-submit command
> -------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-46860
> URL: https://issues.apache.org/jira/browse/SPARK-46860
> Project: Spark
> Issue Type: Task
> Components: k8s
> Affects Versions: 3.3.3, 3.5.0, 3.3.4
> Environment: Spark 3.3.3 deployed on K8s
> Reporter: Vikram Janarthanan
> Priority: Major
> Labels: pull-request-available
>
> We are trying to run the spark application by pointing the dependent files as
> well the main pyspark script from secure webserver
> We are looking for solution to pass the dependencies as well as pysaprk
> script from webserver.
> we have tried deploying the spark application from webserver to k8s cluster
> without username and password and it worked, but when tried with
> username/password we are facing "Exception in thread "{*}main"
> java.io.IOException: Server returned HTTP response code: 401 for URL:
> https://username:[email protected]/application/pysparkjob.py{*}"
> *Working options on spark-submit:*
> spark-submit ......
> --repositories https://username:[email protected]/repo1/repo
> --jars https://domain.com/jars/runtime.jar \
> --files https://domain.com/files/query.sql \
> --py-files [https://domain.com/pythonlib/pythonlib.zip] \
> https://domain.com/app1/pysparkapp.py
> Note: only repositories option works with username and password
> *Spark-submit using https url with username/password not working:*
> spark-submit ......
> --jars https://username:[email protected]/jars/runtime.jar \
> --files https://username:[email protected]/files/query.sql \
> --py-files
> https://username:[email protected][/pythonlib/pythonlib.zip|https://domain.com/pythonlib/pythonlib.zip]
> \
> https://username:[email protected]/app1/pysparkapp.py
>
> Error :
> 25/01/23 09:19:57 WARN NativeCodeLoader: Unable to load native-hadoop library
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.io.IOException: Server returned HTTP response
> code: 401 for URL:
> https://username:[email protected]/repository/spark-artifacts/pysparkdemo/1.0/pysparkdemo-1.0.tgz
> at
> java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
> at
> java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
> at
> java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:224)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:809)
> at
> org.apache.spark.util.DependencyUtils$.downloadFile(DependencyUtils.scala:264)
> at
> org.apache.spark.util.DependencyUtils$.$anonfun$downloadFileList$2(DependencyUtils.scala:233)
> at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> at
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
> at scala.collection.TraversableLike.map(TraversableLike.scala:286)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]