GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/21426

    [SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correctly into 
PythonRunner in submit with client mode in spark-submit

    ## What changes were proposed in this pull request?
    
    In client side before context initialization specifically,  .py file 
doesn't work in client side before context initialization when the application 
is a Python file. See below:
    
    ```
    $ cat /home/spark/tmp.py
    def testtest():
        return 1
    ```
    
    This works:
    
    ```
    $ cat app.py
    import pyspark
    pyspark.sql.SparkSession.builder.getOrCreate()
    import tmp
    print("************************%s" % tmp.testtest())
    
    $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
/home/spark/tmp.py app.py
    ...
    ************************1
    ```
    
    but this doesn't:
    
    ```
    $ cat app.py
    import pyspark
    import tmp
    pyspark.sql.SparkSession.builder.getOrCreate()
    print("************************%s" % tmp.testtest())
    
    $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
/home/spark/tmp.py app.py
    Traceback (most recent call last):
      File "/home/spark/spark/app.py", line 2, in <module>
        import tmp
    ImportError: No module named tmp
    ```
    
    ### How did it happen?
    
    In client mode specifically, the paths are being added into PythonRunner as 
are:
    
    
https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L430
    
    
https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L49-L88
    
    The problem here is, .py file shouldn't be added as are since `PYTHONPATH` 
expects a directory or an archive like zip or egg.
    
    ### How does this PR fix?
    
    We shouldn't simply just add its parent directory because other files in 
the parent directory could also be added into the `PYTHONPATH` in client mode 
before context initialization.
    
    Therefore, we copy .py files into a temp directory for .py files and add it 
to `PYTHONPATH`.
    
    ## How was this patch tested?
    
    Unit tests are added and manually tested in both standalond and yarn client 
modes with submit.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-24384

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21426.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21426
    
----
commit b76854dc58b4cd5c73933cff2b8b7d8e3ffb23ac
Author: hyukjinkwon <gurwls223@...>
Date:   2018-05-24T17:34:31Z

    Add .py files correctly into PythonRunner in submit with client mode in 
spark-submit

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to