Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13599#discussion_r160091411
  
    --- Diff: docs/submitting-applications.md ---
    @@ -218,6 +218,73 @@ These commands can be used with `pyspark`, 
`spark-shell`, and `spark-submit` to
     For Python, the equivalent `--py-files` option can be used to distribute 
`.egg`, `.zip` and `.py` libraries
     to executors.
     
    +# VirtualEnv for Pyspark
    +For simple PySpark application, we can use `--py-files` to add its 
dependencies. While for a large PySpark application,
    +usually you will have many dependencies which may also have transitive 
dependencies and even some dependencies need to be compiled
    +to be installed. In this case `--py-files` is not so convenient. Luckily, 
in python world we have virtualenv/conda to help create isolated
    +python work environment. We also implement virtualenv in PySpark (It is 
only supported in yarn mode for now). 
    +
    +# Prerequisites
    +- Each node have virtualenv/conda, python-devel installed
    +- Each node is internet accessible (for downloading packages)
    +
    +{% highlight bash %}
    +# Setup virtualenv using native virtualenv on yarn-client mode
    +bin/spark-submit \
    +    --master yarn \
    +    --deploy-mode client \
    +    --conf "spark.pyspark.virtualenv.enabled=true" \
    +    --conf "spark.pyspark.virtualenv.type=native" \
    +    --conf 
"spark.pyspark.virtualenv.requirements=<local_requirement_file>" \
    +    --conf "spark.pyspark.virtualenv.bin.path=<virtualenv_bin_path>" \
    +    <pyspark_script>
    +
    +# Setup virtualenv using conda on yarn-client mode
    +bin/spark-submit \
    +    --master yarn \
    +    --deploy-mode client \
    +    --conf "spark.pyspark.virtualenv.enabled=true" \
    +    --conf "spark.pyspark.virtualenv.type=conda" \
    +    --conf 
"spark.pyspark.virtualenv.requirements=<<local_requirement_file>" \
    +    --conf "spark.pyspark.virtualenv.bin.path=<conda_bin_path>" \
    +    <pyspark_script>
    +{% endhighlight %}
    +
    +## PySpark VirtualEnv Configurations
    +<table class="table">
    +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +<tr>
    +  <td><code>spark.pyspark.virtualenv.enabled</code></td>
    +  <td>false</td>
    +  <td>Whether to enable virtualenv</td>
    +</tr>
    +<tr>
    +  <td><code>Spark.pyspark.virtualenv.type</code></td>
    +  <td>virtualenv</td>
    --- End diff --
    
    `native` instead of `virtualenv`?
    
    Btw, should we use `native` for the config value to indicate virtualenv? 
I'd prefer `virtualenv` instead.
      


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to