Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2651#issuecomment-57927210
Before 1.2 release, maybe it's time to rethink how to run pyspark shell or
scripts, using bin/pyspark or spark-submit is not so friendly for user, maybe
we could simplify it.
The most things that pyspark does are setup SPARK_HOME and PYTHONPATH, so I
did these in .bashrc:
```
export SPARK_HOME=xxxx
export
PYTHONPATH=${PYTHONPATH}:${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip
```
Then I could run any python script to use pyspark (most of them are
testing). I can easily choose the version of python to use, such as ipython:
```
ipython python/pyspark/tests.py
```
If the version of python I used for driver is not binary compatible with
the default one, then I need to use PYSPARK_PYTHON
```
PYSPARK_PYTHON=pypy pypy rdd.py
```
I think we could find the correct version to use for worker, PYSPARK_PYTHON
is not need for most cases. For example, if PYSPARK_PYTHON is not set, by
default, we could use the path of python used in driver for it. We could create
some special cases, such as ipython, we could use python2.7 for ipython which
uses python2.7.
Also, we could create SPARK_HOME and PYTHONPATH during install spark for
user.
bin/pyspark could be called sparkshell.py, then user could easily choose
whatever version of python to use it.
bin/spark-submit is still useful to submit the jobs into cluster or adding
files. In the same time, may be we could introduce some default arguments for
general pyspark scripts. such as:
```bash
$ipython
daviesliu@dm:~/work/spark$ ipython wc.py -h
Usage: wc.py [options] [args]
Options:
-h, --help show this help message and exit
-q, --quiet
-v, --verbose
PySpark Options:
-m MASTER, --master=MASTER
-p PARALLEL, --parallel=PARALLEL number of processes
-c CPUS, --cpus=CPUS cpus used per task
-M MEM, --mem=MEM memory used per task
--conf=CONF path for configuration file
--profile do profiling
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]