[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

HyukjinKwon Thu, 07 Jun 2018 02:03:22 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13599#discussion_r193672797
  
    --- Diff: docs/submitting-applications.md ---
    @@ -218,6 +218,115 @@ These commands can be used with `pyspark`, 
`spark-shell`, and `spark-submit` to
     For Python, the equivalent `--py-files` option can be used to distribute 
`.egg`, `.zip` and `.py` libraries
     to executors.
     
    +# VirtualEnv for PySpark (This is an experimental feature and may evolve 
in future version)
    +For simple PySpark application, we can use `--py-files` to add its 
dependencies. While for a large PySpark application,
    +usually you will have many dependencies which may also have transitive 
dependencies and even some dependencies need to be compiled
    +first to be installed. In this case `--py-files` is not so convenient. 
Luckily, in python world we have virtualenv/conda to help create isolated
    +python work environment. We also implement virtualenv in PySpark (It is 
only supported in yarn mode for now). User can use this feature
    +in 2 scenarios:
    +* Batch mode (submit spark app via spark-submit)
    +* Interactive mode (PySpark shell or other third party Spark Notebook)
    +
    +## Prerequisites
    +- Each node have virtualenv/conda, python-devel installed
    +- Each node is internet accessible (for downloading packages)
    +
    +## Batch Mode
    +
    +In batch mode, user need to specify the additional python packages before 
launching spark app. There're 2 approaches to specify that:
    +* Provide a requirement file which contains all the packages for the 
virtualenv.  
    +* Specify packages via spark configuration 
`spark.pyspark.virtualenv.packages`.
    +
    +Here're several examples:
    +
    +{% highlight bash %}
    +### Setup virtualenv using native virtualenv on yarn-client mode
    +bin/spark-submit \
    +    --master yarn \
    +    --deploy-mode client \
    +    --conf "spark.pyspark.virtualenv.enabled=true" \
    +    --conf "spark.pyspark.virtualenv.type=native" \
    +    --conf 
"spark.pyspark.virtualenv.requirements=<local_requirement_file>" \
    +    --conf "spark.pyspark.virtualenv.bin.path=<virtualenv_bin_path>" \
    +    <pyspark_script>
    +
    +### Setup virtualenv using conda on yarn-client mode
    +bin/spark-submit \
    +    --master yarn \
    +    --deploy-mode client \
    +    --conf "spark.pyspark.virtualenv.enabled=true" \
    +    --conf "spark.pyspark.virtualenv.type=conda" \
    +    --conf 
"spark.pyspark.virtualenv.requirements=<local_requirement_file>" \
    +    --conf "spark.pyspark.virtualenv.bin.path=<conda_bin_path>" \
    +    <pyspark_script>
    +    
    +### Setup virtualenv using conda on yarn-client mode and specify packages 
via `spark.pyspark.virtualenv.packages`
    +bin/spark-submit \
    +    --master yarn \
    +    --deploy-mode client \
    +    --conf "spark.pyspark.virtualenv.enabled=true" \
    +    --conf "spark.pyspark.virtualenv.type=conda" \
    +    --conf "spark.pyspark.virtualenv.packages=numpy,pandas" \
    +    --conf "spark.pyspark.virtualenv.bin.path=<conda_bin_path>" \
    +    <pyspark_script>
    +{% endhighlight %}
    +
    +### How to create requirement file ?
    +Usually before running distributed PySpark job, you need first to run it 
in local environment. It is encouraged to first create your own virtualenv for 
your project, so you know what packages you need. After you are confident with 
your work and want to move it to cluster, you can run the following command to 
generate the requirement file for virtualenv and conda.
    +- pip freeze > requirements.txt
    +- conda list --export  > requirements.txt
    +
    +## Interactive Mode
    +In interactive modeï¼user can install python packages at runtime instead 
of specifying them in requirement file when submitting spark app.
    +Here are several ways to install packages
    +
    +{% highlight python %}
    +sc.install_packages("numpy")                   # install the latest numpy
    --- End diff --
    
    Seems there are tabs here. Shall we replace them to spaces?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

Reply via email to