GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/21426
[SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correctly into
PythonRunner in submit with client mode in spark-submit
## What changes were proposed in this pull request?
In client side before context initialization specifically, .py file
doesn't work in client side before context initialization when the application
is a Python file. See below:
```
$ cat /home/spark/tmp.py
def testtest():
return 1
```
This works:
```
$ cat app.py
import pyspark
pyspark.sql.SparkSession.builder.getOrCreate()
import tmp
print("************************%s" % tmp.testtest())
$ ./bin/spark-submit --master yarn --deploy-mode client --py-files
/home/spark/tmp.py app.py
...
************************1
```
but this doesn't:
```
$ cat app.py
import pyspark
import tmp
pyspark.sql.SparkSession.builder.getOrCreate()
print("************************%s" % tmp.testtest())
$ ./bin/spark-submit --master yarn --deploy-mode client --py-files
/home/spark/tmp.py app.py
Traceback (most recent call last):
File "/home/spark/spark/app.py", line 2, in <module>
import tmp
ImportError: No module named tmp
```
### How did it happen?
In client mode specifically, the paths are being added into PythonRunner as
are:
https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L430
https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L49-L88
The problem here is, .py file shouldn't be added as are since `PYTHONPATH`
expects a directory or an archive like zip or egg.
### How does this PR fix?
We shouldn't simply just add its parent directory because other files in
the parent directory could also be added into the `PYTHONPATH` in client mode
before context initialization.
Therefore, we copy .py files into a temp directory for .py files and add it
to `PYTHONPATH`.
## How was this patch tested?
Unit tests are added and manually tested in both standalond and yarn client
modes with submit.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-24384
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21426.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21426
----
commit b76854dc58b4cd5c73933cff2b8b7d8e3ffb23ac
Author: hyukjinkwon <gurwls223@...>
Date: 2018-05-24T17:34:31Z
Add .py files correctly into PythonRunner in submit with client mode in
spark-submit
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]