Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-57927210
  
    Before 1.2 release, maybe it's time to rethink how to run pyspark shell or 
scripts, using bin/pyspark or spark-submit is not so friendly for user, maybe 
we could simplify it. 
    
    The most things that pyspark does are setup SPARK_HOME and PYTHONPATH, so I 
did these in .bashrc:
    ```
    export SPARK_HOME=xxxx
    export 
PYTHONPATH=${PYTHONPATH}:${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip
    ``` 
    Then I could run any python script to use pyspark (most of them are 
testing). I can easily choose the version of python to use, such as ipython:
    ```
    ipython python/pyspark/tests.py
    ```
    If the version of python I used for driver is not binary compatible with 
the default one, then I need to use PYSPARK_PYTHON
    ```
    PYSPARK_PYTHON=pypy pypy rdd.py
    ```
    I think we could find the correct version to use for worker, PYSPARK_PYTHON 
is not need for most cases. For example, if PYSPARK_PYTHON is not set, by 
default, we could use the path of python used in driver for it. We could create 
some special cases, such as ipython, we could use python2.7 for ipython which 
uses python2.7.
    
    Also, we could create SPARK_HOME and PYTHONPATH during install spark for 
user.
    
    bin/pyspark could be called sparkshell.py, then user could easily choose 
whatever version of python to use it.
    
    bin/spark-submit is still useful to submit the jobs into cluster or adding 
files. In the same time, may be we could introduce some default arguments for 
general pyspark scripts. such as:
    ```bash
    $ipython
    daviesliu@dm:~/work/spark$ ipython wc.py -h
    Usage: wc.py [options] [args]
    
    Options:
      -h, --help            show this help message and exit
      -q, --quiet
      -v, --verbose
    
      PySpark Options:
        -m MASTER, --master=MASTER
        -p PARALLEL, --parallel=PARALLEL number of processes
        -c CPUS, --cpus=CPUS cpus used per task
        -M MEM, --mem=MEM  memory used per task
        --conf=CONF         path for configuration file
        --profile                  do profiling
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to