[GitHub] [spark] HyukjinKwon commented on a change in pull request #29491: [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation

GitBox Fri, 21 Aug 2020 00:06:54 -0700


HyukjinKwon commented on a change in pull request #29491:
URL: https://github.com/apache/spark/pull/29491#discussion_r474451282




##########
File path: python/docs/source/getting_started/quickstart.ipynb
##########
@@ -0,0 +1,1091 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Quickstart\n",
+    "\n",
+    "This is a short introduction and quickstart for PySpark DataFrame. 
PySpark DataFrame is lazily evaludated and implemented on thetop of 
[RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview).
 When the data is 
[transformed](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations),
 it does not actually compute but plans how to compute later. When the 
[actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)
 such as `collect()` are explicitly called, the computation starts.\n",
+    "This notebook shows the basic usages of the DataFrame, geared mainly for 
new users. You can run the latest version of these examples by yourself on a 
live notebook 
[here](https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n",
+    "\n",
+    "There are also other useful information in Apache Spark documentation 
site, see the latest version of [Spark SQL and 
DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), 
[RDD Programming 
Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), 
[Structured Streaming Programming 
Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html),
 [Spark Streaming Programming 
Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) 
and [Machine Learning Library (MLlib) 
Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n",
+    "\n",
+    "Usually PySaprk applications start with initializing `SparkSession` which 
is the entry point of PySpark as below. In case of running it in PySpark shell 
via <code>pyspark</code> executable, the shell automatically creates the 
session in the variable <code>spark</code> for users."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql import SparkSession\n",
+    "\n",
+    "spark = SparkSession.builder.getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## DataFrame Creation\n",
+    "\n",
+    "A PySpark DataFrame can be created via 
`pyspark.sql.SparkSession.createDataFrame` typically by passing a list of 
lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas 
DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) 
and an RDD consisting of such a list.\n",

Review comment:
       Yeah, I faced this one as well. Actually I already dig the history and 
it seems like deprecated when `Row` APIs came out. I think we should 
undeprecate it back.
   
   @nchammas, are you interested in submitting a PR to remove these warnings? 
Manual tests should be good enough.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29491: [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation

Reply via email to