This is an automated email from the ASF dual-hosted git repository.
tgraves pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 37b7d32 [SPARK-30845] Do not upload local pyspark archives for
spark-submit on Yarn
37b7d32 is described below
commit 37b7d32dbd3546c303d31305ed40c6435390bb4d
Author: Shanyu Zhao <[email protected]>
AuthorDate: Mon Jun 8 15:55:49 2020 -0500
[SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn
### What changes were proposed in this pull request?
Use spark-submit to submit a pyspark app on Yarn, and set this in
spark-env.sh:
export
PYSPARK_ARCHIVES_PATH=local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip
You can see that these local archives are still uploaded to Yarn
distributed cache:
yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip ->
hdfs://myhdfs/user/test1/.sparkStaging/application_1581024490249_0001/pyspark.zip
This PR fix this issue by checking the files specified in
PYSPARK_ARCHIVES_PATH, if they are local archives, don't distribute to Yarn
dist cache.
### Why are the changes needed?
For pyspark appp to support local pyspark archives set in
PYSPARK_ARCHIVES_PATH.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests and manual tests.
Closes #27598 from shanyu/shanyu-30845.
Authored-by: Shanyu Zhao <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
---
.../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index fc429d6..7b12119 100644
---
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -635,7 +635,12 @@ private[spark] class Client(
distribute(args.primaryPyFile, appMasterOnly = true)
}
- pySparkArchives.foreach { f => distribute(f) }
+ pySparkArchives.foreach { f =>
+ val uri = Utils.resolveURI(f)
+ if (uri.getScheme != Utils.LOCAL_SCHEME) {
+ distribute(f)
+ }
+ }
// The python files list needs to be treated especially. All files that
are not an
// archive need to be placed in a subdirectory that will be added to
PYTHONPATH.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]