GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/799
[SPARK-1808] Route bin/pyspark through Spark submit
**Problem.** For `bin/pyspark`, there is currently no other way to specify
Spark configuration properties other than through `SPARK_JAVA_OPTS` in
`conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead,
it needs to pick up configurations explicitly specified in
`conf/spark-defaults.conf`.
**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its
counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This
has the additional benefit of making the invocation of all the user facing
Spark scripts consistent.
**Details.** `bin/pyspark` inherently handles two cases: (1) running python
applications and (2) running the python shell. For (1), Spark submit already
handles running python applications. For cases in which `bin/pyspark` is given
a python file, we can simply call pass the file directly to Spark submit and
let it handle the rest.
For case (2), `bin/pyspark` starts a python process as before, which
launches the JVM as a sub-process. The existing code already provides a code
path to do this. All we needed to change is to use `bin/spark-submit` instead
of `spark-class` to launch the JVM. This requires modifications to Spark submit
to handle the pyspark shell as a special case.
This has been tested locally (OSX) for both cases, and using IPython.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark pyspark-submit
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/799.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #799
----
commit a371d26ba770c781b86ed20d2922ab8fc043f52e
Author: Andrew Or <[email protected]>
Date: 2014-05-16T00:08:58Z
Route bin/pyspark through Spark submit
The bin/pyspark script takes two pathways, depending on the application.
If the application is a python file, bin/pyspark passes the python file
directly to Spark submit, which launches the python application as a
sub-process within the JVM.
If the application is the pyspark shell, however, bin/pyspark starts
the python REPL as the parent process, which launches the JVM as a
sub-process. A significant benefit here is that all keyboard signals
are propagated first to the Python interpreter properly. The existing
code already provided a code path to do this; all we need to change
is to use spark-submit instead of spark-class to launch the JVM. This
requires modifications to Spark submit to handle the pyspark shell
as a special case.
This has been tested locally (OSX) for both cases, and using IPython.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---