GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/799

    [SPARK-1808] Route bin/pyspark through Spark submit

    **Problem.** For `bin/pyspark`, there is currently no other way to specify 
Spark configuration properties other than through `SPARK_JAVA_OPTS` in 
`conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, 
it needs to pick up configurations explicitly specified in 
`conf/spark-defaults.conf`.
    
    **Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its 
counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This 
has the additional benefit of making the invocation of all the user facing 
Spark scripts consistent.
    
    **Details.** `bin/pyspark` inherently handles two cases: (1) running python 
applications and (2) running the python shell. For (1), Spark submit already 
handles running python applications. For cases in which `bin/pyspark` is given 
a python file, we can simply call pass the file directly to Spark submit and 
let it handle the rest.
    
    For case (2), `bin/pyspark` starts a python process as before, which 
launches the JVM as a sub-process. The existing code already provides a code 
path to do this. All we needed to change is to use `bin/spark-submit` instead 
of `spark-class` to launch the JVM. This requires modifications to Spark submit 
to handle the pyspark shell as a special case.
    
    This has been tested locally (OSX) for both cases, and using IPython.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark pyspark-submit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/799.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #799
    
----
commit a371d26ba770c781b86ed20d2922ab8fc043f52e
Author: Andrew Or <[email protected]>
Date:   2014-05-16T00:08:58Z

    Route bin/pyspark through Spark submit
    
    The bin/pyspark script takes two pathways, depending on the application.
    
    If the application is a python file, bin/pyspark passes the python file
    directly to Spark submit, which launches the python application as a
    sub-process within the JVM.
    
    If the application is the pyspark shell, however, bin/pyspark starts
    the python REPL as the parent process, which launches the JVM as a
    sub-process. A significant benefit here is that all keyboard signals
    are propagated first to the Python interpreter properly. The existing
    code already provided a code path to do this; all we need to change
    is to use spark-submit instead of spark-class to launch the JVM. This
    requires modifications to Spark submit to handle the pyspark shell
    as a special case.
    
    This has been tested locally (OSX) for both cases, and using IPython.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to