warrenzhu25 commented on a change in pull request #30282:
URL: https://github.com/apache/spark/pull/30282#discussion_r519459534
##########
File path:
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
##########
@@ -130,6 +130,13 @@ package object config extends Logging {
.stringConf
.createOptional
+ private[spark] val SPARK_PYSPARK_ARCHIVE =
ConfigBuilder("spark.yarn.pyspark.archives")
+ .doc("Location of pyspark.zip and py4j.zip.")
Review comment:
I tried to use --py-files, but it has path and resources like below:
```
PYTHONPATH ->
{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.7-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip
resources:
__spark_conf__ -> resource { scheme: "hdfs" host: "MTPrime-CO4-fed"
port: -1 file:
"/user/zhonzh/.sparkStaging/application_1604622164128_7216/__spark_conf__.zip"
} size: 536359 timestamp: 1604858318432 type: ARCHIVE visibility: PRIVATE
pyspark.zip -> resource { scheme: "hdfs" host: "MTPrime-CO4-fed" port:
-1 file:
"/user/zhonzh/.sparkStaging/application_1604622164128_7216/pyspark.zip" } size:
595809 timestamp: 1604858311600 type: FILE visibility: PUBLIC
py4j-0.10.9-src.zip -> resource { scheme: "hdfs" host: "MTPrime-CO4-fed"
port: -1 file:
"/user/zhonzh/.sparkStaging/application_1604622164128_7216/py4j-0.10.9-src.zip"
} size: 41587 timestamp: 1604858316398 type: FILE visibility: PUBLIC
__spark_libs__ -> resource { scheme: "hdfs" host: "MTPrime-CO4-fed"
port: -1 file:
"/user/zhonzh/.sparkStaging/application_1604622164128_7216/spark-3.0.1-mt-jars.zip"
} size: 197891674 timestamp: 1604858291631 type: ARCHIVE visibility: PUBLIC
py4j-0.10.7-src.zip -> resource { scheme: "hdfs" host: "MTPrime-CO4-fed"
port: -1 file:
"/user/zhonzh/.sparkStaging/application_1604622164128_7216/py4j-0.10.7-src.zip"
} size: 42437 timestamp: 1604858314170 type: FILE visibility: PUBLIC
```
I used `--py-files
"hdfs://MTPrime-CO4-0/user/zhonzh/pyspark.zip,hdfs://MTPrime-CO4-0/user/zhonzh/py4j-0.10.9-src.zip"`.
This is from spark3 while local python lib is spark 2.4.
We have 2 issues here:
1. Local pyspark.zip is added first, it take precedence. This cause passed
by pyFiles not working.
2. If I use same name as pyspark.zip, the upload will be skipped as both
have same name.
What's your suggestions to handle this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]