Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160091411 --- Diff: docs/submitting-applications.md --- @@ -218,6 +218,73 @@ These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries to executors. +# VirtualEnv for Pyspark +For simple PySpark application, we can use `--py-files` to add its dependencies. While for a large PySpark application, +usually you will have many dependencies which may also have transitive dependencies and even some dependencies need to be compiled +to be installed. In this case `--py-files` is not so convenient. Luckily, in python world we have virtualenv/conda to help create isolated +python work environment. We also implement virtualenv in PySpark (It is only supported in yarn mode for now). + +# Prerequisites +- Each node have virtualenv/conda, python-devel installed +- Each node is internet accessible (for downloading packages) + +{% highlight bash %} +# Setup virtualenv using native virtualenv on yarn-client mode +bin/spark-submit \ + --master yarn \ + --deploy-mode client \ + --conf "spark.pyspark.virtualenv.enabled=true" \ + --conf "spark.pyspark.virtualenv.type=native" \ + --conf "spark.pyspark.virtualenv.requirements=<local_requirement_file>" \ + --conf "spark.pyspark.virtualenv.bin.path=<virtualenv_bin_path>" \ + <pyspark_script> + +# Setup virtualenv using conda on yarn-client mode +bin/spark-submit \ + --master yarn \ + --deploy-mode client \ + --conf "spark.pyspark.virtualenv.enabled=true" \ + --conf "spark.pyspark.virtualenv.type=conda" \ + --conf "spark.pyspark.virtualenv.requirements=<<local_requirement_file>" \ + --conf "spark.pyspark.virtualenv.bin.path=<conda_bin_path>" \ + <pyspark_script> +{% endhighlight %} + +## PySpark VirtualEnv Configurations +<table class="table"> +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> +<tr> + <td><code>spark.pyspark.virtualenv.enabled</code></td> + <td>false</td> + <td>Whether to enable virtualenv</td> +</tr> +<tr> + <td><code>Spark.pyspark.virtualenv.type</code></td> + <td>virtualenv</td> --- End diff -- `native` instead of `virtualenv`? Btw, should we use `native` for the config value to indicate virtualenv? I'd prefer `virtualenv` instead.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org