HyukjinKwon commented on code in PR #41942:
URL: https://github.com/apache/spark/pull/41942#discussion_r1260951116
##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1775,21 +1773,31 @@ class SparkContext(config: SparkConf) extends Logging {
}
val timestamp = if (addedOnSubmit) startTime else System.currentTimeMillis
+ // If the session ID was specified from SparkSession, it's from a Spark
Connect client.
+ // Specify a dedicated directory for Spark Connect client.
+ // We're running Spark Connect as a service so regular PySpark path
+ // is not affected.
+ lazy val root = if (jobArtifactUUID != "default") {
+ val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID)
Review Comment:
Yeah, it now needs to reuse `PythonWorkerFactory` in which assumes that
there is a UUID named directory under `SparkFiles.getRootDirectory()` at both
Driver and Executor. We _could_ try to reuse the local artifact directory but I
would prefer to have another copy in the local for now for better
maintainability and reusability for now.
Otherwise, it does upload to the Spark file server twice (as we discussed
offline). I pushed new changes to avoid this. So, after this change, we do not
upload twice anymore by:
1. Directly pass the `spark://` URI to `addFile` and `addJar`
2. `addFile` and `addJar` will not attempt to upload the files, but bypass
the original URI.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]