GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/21267

    [SPARK-21945][YARN][PYTHON] Make --py-files work in PySpark shell in Yarn 
client mode

    ## What changes were proposed in this pull request?
    
    ### Problem
    
    When we run _PySpark shell with Yarn client mode_, specified `--py-files` 
are not recognised in _driver side_.
    
    Here are the steps I took to check:
    
    ```bash
    $ cat /home/spark/tmp.py
    def testtest():
        return 1
    ```
    
    ```bash
    $ ./bin/pyspark --master yarn --deploy-mode client --py-files 
/home/spark/tmp.py
    ```
    
    ```python
    >>> def test():
    ...     import tmp
    ...     return tmp.testtest()
    ...
    >>> spark.range(1).rdd.map(lambda _: test()).collect()  # executor side
    [1]
    >>> test()  # driver side
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 2, in test
    ImportError: No module named tmp
    ```
    
    ### How it happened?
    
    Unlike Yarn cluster and client mode with Spark submit, when Yarn client 
mode with PySpark shell specifically,
    
    1. It first runs Python shell via:
    
    
https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L158
 as pointed out by @tgravescs in the JIRA.
    
    2. this triggers shell.py and submit another application to launch a py4j 
gateway:
    
    
https://github.com/apache/spark/blob/209b9361ac8a4410ff797cff1115e1888e2f7e66/python/pyspark/java_gateway.py#L45-L60
    
    3. it runs a Py4J gateway:
    
    
https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L425
    
    4. it copies --py-files  into local temp directory:
    
    
https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L365-L376
    
    and then these directories are set up to `spark.submit.pyFiles`
    
    5. Py4J JVM is launched and then the Python paths are set via:
    
    
https://github.com/apache/spark/blob/7013eea11cb32b1e0038dc751c485da5c94a484b/python/pyspark/context.py#L209-L216
    
    However, these are not actually set because those files were copied into a 
tmp directory in 4. whereas this code path looks for 
`SparkFiles.getRootDirectory` where the files are stored only when 
`SparkContext.addFile()` is called.
    
    In other cluster mode, `spark.files` are set via:
    
    
https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L554-L555
    
    and those files are explicitly added via:
    
    
https://github.com/apache/spark/blob/ecb8b383af1cf1b67f3111c148229e00c9c17c40/core/src/main/scala/org/apache/spark/SparkContext.scala#L395
    
    So we are fine in other modes.
    
    In case of Yarn client and submit with _submit_, these are manually being 
handled. In particular https://github.com/apache/spark/pull/6360 added most of 
the logics. In this case, the Python path looks manually set via, for example, 
`deploy.PythonRunner`. We don't use `spark.files` here. 
    
    ### How does the PR fix the problem?
    
    I tried to make an isolated approach as possible as I can: simply copy py 
file or zip files into `SparkFiles.getRootDirectory()` in driver side if not 
existing. Another possible way is to set `spark.files` but it does unnecessary 
stuff together and sounds a bit invasive.
    
    ### Before
    
    ```python
    >>> def test():
    ...     import tmp
    ...     return tmp.testtest()
    ...
    >>> spark.range(1).rdd.map(lambda _: test()).collect()
    [1]
    >>> test()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 2, in test
    ImportError: No module named tmp
    ```
    
    ### After
    
    ```python
    >>> def test():
    ...     import tmp
    ...     return tmp.testtest()
    ...
    >>> spark.range(1).rdd.map(lambda _: test()).collect()
    
    [1]
    >>> test()
    1
    ```
    
    ## How was this patch tested?
    
    I manually tested in standalone and yarn cluster with PySpark shell. .zip 
and .py files were also tested with the similar steps above. It's difficult to 
add a test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-21945

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21267.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21267
    
----
commit 68be3baef22d8b7aa58a432cb5bd12437c07feb7
Author: hyukjinkwon <gurwls223@...>
Date:   2018-05-08T07:36:31Z

    Make --py-files work in PySpark shell in Yarn client mode

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to