Repository: spark Updated Branches: refs/heads/branch-2.2 88dccda39 -> da403b953
[SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation. Update the Quickstart and RDD programming guides to mention pip. Built docs locally. Author: Holden Karau <[email protected]> Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation. (cherry picked from commit cc00e99d5396893b2d3d50960161080837cf950a) Signed-off-by: Holden Karau <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/da403b95 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/da403b95 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/da403b95 Branch: refs/heads/branch-2.2 Commit: da403b95353f064c24da25236fa7f905fa8ddca1 Parents: 88dccda Author: Holden Karau <[email protected]> Authored: Fri Jul 21 16:50:47 2017 -0700 Committer: Holden Karau <[email protected]> Committed: Fri Jul 21 16:53:39 2017 -0700 ---------------------------------------------------------------------- docs/quick-start.md | 27 ++++++++++++++++++++++++++- docs/rdd-programming-guide.md | 15 ++++++++++++--- 2 files changed, 38 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/da403b95/docs/quick-start.md ---------------------------------------------------------------------- diff --git a/docs/quick-start.md b/docs/quick-start.md index b88ae5f..cb5211a 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -66,6 +66,11 @@ res3: Long = 15 ./bin/pyspark + +Or if PySpark is installed with pip in your current enviroment: + + pyspark + Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let's make a new DataFrame from the text of the README file in the Spark source directory: {% highlight python %} @@ -206,7 +211,7 @@ a cluster, as described in the [RDD programming guide](rdd-programming-guide.htm # Self-Contained Applications Suppose we wish to write a self-contained application using the Spark API. We will walk through a -simple application in Scala (with sbt), Java (with Maven), and Python. +simple application in Scala (with sbt), Java (with Maven), and Python (pip). <div class="codetabs"> <div data-lang="scala" markdown="1"> @@ -367,6 +372,16 @@ Lines with a: 46, Lines with b: 23 Now we will show how to write an application using the Python API (PySpark). + +If you are building a packaged PySpark application or library you can add it to your setup.py file as: + +{% highlight python %} + install_requires=[ + 'pyspark=={site.SPARK_VERSION}' + ] +{% endhighlight %} + + As an example, we'll create a simple Spark application, `SimpleApp.py`: {% highlight python %} @@ -406,6 +421,16 @@ $ YOUR_SPARK_HOME/bin/spark-submit \ Lines with a: 46, Lines with b: 23 {% endhighlight %} +If you have PySpark pip installed into your enviroment (e.g. `pip instal pyspark` you can run your application with the regular Python interpeter or use the provided spark-submit as you prefer. + +{% highlight bash %} +# Use spark-submit to run your application +$ python SimpleApp.py +... +Lines with a: 46, Lines with b: 23 +{% endhighlight %} + + </div> </div> http://git-wip-us.apache.org/repos/asf/spark/blob/da403b95/docs/rdd-programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index d021b73..8e6c36b 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -86,12 +86,21 @@ import org.apache.spark.SparkConf; <div data-lang="python" markdown="1"> -Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter, +Spark {{site.SPARK_VERSION}} works with Python 2.7+ or Python 3.4+. It can use the standard CPython interpreter, so C libraries like NumPy can be used. It also works with PyPy 2.3+. -Note that support for Python 2.6 is deprecated as of Spark 2.0.0, and may be removed in Spark 2.2.0. +Python 2.6 support was removed in Spark 2.2.0. -To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory. +Spark applications in Python can either be run with the `bin/spark-submit` script which includes Spark at runtime, or by including including it in your setup.py as: + +{% highlight python %} + install_requires=[ + 'pyspark=={site.SPARK_VERSION}' + ] +{% endhighlight %} + + +To run Spark applications in Python without pip installing PySpark, use the `bin/spark-submit` script located in the Spark directory. This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster. You can also use `bin/pyspark` to launch an interactive Python shell. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
