spark git commit: [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.

holden Fri, 21 Jul 2017 16:55:12 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 88dccda39 -> da403b953



[SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.

Update the Quickstart and RDD programming guides to mention pip.

Built docs locally.

Author: Holden Karau <[email protected]>

Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.

(cherry picked from commit cc00e99d5396893b2d3d50960161080837cf950a)
Signed-off-by: Holden Karau <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/da403b95
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/da403b95
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/da403b95

Branch: refs/heads/branch-2.2
Commit: da403b95353f064c24da25236fa7f905fa8ddca1
Parents: 88dccda
Author: Holden Karau <[email protected]>
Authored: Fri Jul 21 16:50:47 2017 -0700
Committer: Holden Karau <[email protected]>
Committed: Fri Jul 21 16:53:39 2017 -0700

----------------------------------------------------------------------
 docs/quick-start.md           | 27 ++++++++++++++++++++++++++-
 docs/rdd-programming-guide.md | 15 ++++++++++++---
 2 files changed, 38 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/da403b95/docs/quick-start.md
----------------------------------------------------------------------
diff --git a/docs/quick-start.md b/docs/quick-start.md
index b88ae5f..cb5211a 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -66,6 +66,11 @@ res3: Long = 15
 
     ./bin/pyspark
 
+
+Or if PySpark is installed with pip in your current enviroment:
+
+    pyspark
+
 Spark's primary abstraction is a distributed collection of items called a 
Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) 
or by transforming other Datasets. Due to Python's dynamic nature, we don't 
need the Dataset to be strongly-typed in Python. As a result, all Datasets in 
Python are Dataset[Row], and we call it `DataFrame` to be consistent with the 
data frame concept in Pandas and R. Let's make a new DataFrame from the text of 
the README file in the Spark source directory:
 
 {% highlight python %}
@@ -206,7 +211,7 @@ a cluster, as described in the [RDD programming 
guide](rdd-programming-guide.htm
 
 # Self-Contained Applications
 Suppose we wish to write a self-contained application using the Spark API. We 
will walk through a
-simple application in Scala (with sbt), Java (with Maven), and Python.
+simple application in Scala (with sbt), Java (with Maven), and Python (pip).
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -367,6 +372,16 @@ Lines with a: 46, Lines with b: 23
 
 Now we will show how to write an application using the Python API (PySpark).
 
+
+If you are building a packaged PySpark application or library you can add it 
to your setup.py file as:
+
+{% highlight python %}
+    install_requires=[
+        'pyspark=={site.SPARK_VERSION}'
+    ]
+{% endhighlight %}
+
+
 As an example, we'll create a simple Spark application, `SimpleApp.py`:
 
 {% highlight python %}
@@ -406,6 +421,16 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
 Lines with a: 46, Lines with b: 23
 {% endhighlight %}
 
+If you have PySpark pip installed into your enviroment (e.g. `pip instal 
pyspark` you can run your application with the regular Python interpeter or use 
the provided spark-submit as you prefer.
+
+{% highlight bash %}
+# Use spark-submit to run your application
+$ python SimpleApp.py
+...
+Lines with a: 46, Lines with b: 23
+{% endhighlight %}
+
+
 </div>
 </div>
 

http://git-wip-us.apache.org/repos/asf/spark/blob/da403b95/docs/rdd-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index d021b73..8e6c36b 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -86,12 +86,21 @@ import org.apache.spark.SparkConf;
 
 <div data-lang="python"  markdown="1">
 
-Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use 
the standard CPython interpreter,
+Spark {{site.SPARK_VERSION}} works with Python 2.7+ or Python 3.4+. It can use 
the standard CPython interpreter,
 so C libraries like NumPy can be used. It also works with PyPy 2.3+.
 
-Note that support for Python 2.6 is deprecated as of Spark 2.0.0, and may be 
removed in Spark 2.2.0.
+Python 2.6 support was removed in Spark 2.2.0.
 
-To run Spark applications in Python, use the `bin/spark-submit` script located 
in the Spark directory.
+Spark applications in Python can either be run with the `bin/spark-submit` 
script which includes Spark at runtime, or by including including it in your 
setup.py as:
+
+{% highlight python %}
+    install_requires=[
+        'pyspark=={site.SPARK_VERSION}'
+    ]
+{% endhighlight %}
+
+
+To run Spark applications in Python without pip installing PySpark, use the 
`bin/spark-submit` script located in the Spark directory.
 This script will load Spark's Java/Scala libraries and allow you to submit 
applications to a cluster.
 You can also use `bin/pyspark` to launch an interactive Python shell.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.

Reply via email to