Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r193672797 --- Diff: docs/submitting-applications.md --- @@ -218,6 +218,115 @@ These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries to executors. +# VirtualEnv for PySpark (This is an experimental feature and may evolve in future version) +For simple PySpark application, we can use `--py-files` to add its dependencies. While for a large PySpark application, +usually you will have many dependencies which may also have transitive dependencies and even some dependencies need to be compiled +first to be installed. In this case `--py-files` is not so convenient. Luckily, in python world we have virtualenv/conda to help create isolated +python work environment. We also implement virtualenv in PySpark (It is only supported in yarn mode for now). User can use this feature +in 2 scenarios: +* Batch mode (submit spark app via spark-submit) +* Interactive mode (PySpark shell or other third party Spark Notebook) + +## Prerequisites +- Each node have virtualenv/conda, python-devel installed +- Each node is internet accessible (for downloading packages) + +## Batch Mode + +In batch mode, user need to specify the additional python packages before launching spark app. There're 2 approaches to specify that: +* Provide a requirement file which contains all the packages for the virtualenv. +* Specify packages via spark configuration `spark.pyspark.virtualenv.packages`. + +Here're several examples: + +{% highlight bash %} +### Setup virtualenv using native virtualenv on yarn-client mode +bin/spark-submit \ + --master yarn \ + --deploy-mode client \ + --conf "spark.pyspark.virtualenv.enabled=true" \ + --conf "spark.pyspark.virtualenv.type=native" \ + --conf "spark.pyspark.virtualenv.requirements=<local_requirement_file>" \ + --conf "spark.pyspark.virtualenv.bin.path=<virtualenv_bin_path>" \ + <pyspark_script> + +### Setup virtualenv using conda on yarn-client mode +bin/spark-submit \ + --master yarn \ + --deploy-mode client \ + --conf "spark.pyspark.virtualenv.enabled=true" \ + --conf "spark.pyspark.virtualenv.type=conda" \ + --conf "spark.pyspark.virtualenv.requirements=<local_requirement_file>" \ + --conf "spark.pyspark.virtualenv.bin.path=<conda_bin_path>" \ + <pyspark_script> + +### Setup virtualenv using conda on yarn-client mode and specify packages via `spark.pyspark.virtualenv.packages` +bin/spark-submit \ + --master yarn \ + --deploy-mode client \ + --conf "spark.pyspark.virtualenv.enabled=true" \ + --conf "spark.pyspark.virtualenv.type=conda" \ + --conf "spark.pyspark.virtualenv.packages=numpy,pandas" \ + --conf "spark.pyspark.virtualenv.bin.path=<conda_bin_path>" \ + <pyspark_script> +{% endhighlight %} + +### How to create requirement file ? +Usually before running distributed PySpark job, you need first to run it in local environment. It is encouraged to first create your own virtualenv for your project, so you know what packages you need. After you are confident with your work and want to move it to cluster, you can run the following command to generate the requirement file for virtualenv and conda. +- pip freeze > requirements.txt +- conda list --export > requirements.txt + +## Interactive Mode +In interactive modeï¼user can install python packages at runtime instead of specifying them in requirement file when submitting spark app. +Here are several ways to install packages + +{% highlight python %} +sc.install_packages("numpy") # install the latest numpy --- End diff -- Seems there are tabs here. Shall we replace them to spaces?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org