HyukjinKwon commented on a change in pull request #29491:
URL: https://github.com/apache/spark/pull/29491#discussion_r476239875
##########
File path: python/docs/source/getting_started/quickstart.ipynb
##########
@@ -0,0 +1,1177 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Quickstart\n",
+ "\n",
+ "This is a short introduction and quickstart for the PySpark DataFrame
API. PySpark DataFrames are lazily evaluated. They are implemented on top of
[RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)s.
When Spark
[transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations)
data, it does not immediately compute the transformation but plans how to
compute later. When
[actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)
such as `collect()` are explicitly called, the computation starts.\n",
+ "This notebook shows the basic usages of the DataFrame, geared mainly for
new users. You can run the latest version of these examples by yourself on a
live notebook
[here](https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n",
+ "\n",
+ "There are also other useful information in Apache Spark documentation
site, see the latest version of [Spark SQL and
DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html),
[RDD Programming
Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html),
[Structured Streaming Programming
Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html),
[Spark Streaming Programming
Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
and [Machine Learning Library (MLlib)
Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n",
+ "\n",
+ "Usually PySaprk applications start with initializing `SparkSession` which
is the entry point of PySpark as below. In case of running it in PySpark shell
via <code>pyspark</code> executable, the shell automatically creates the
session in the variable <code>spark</code> for users."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pyspark.sql import SparkSession\n",
+ "\n",
+ "spark = SparkSession.builder.getOrCreate()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrame Creation\n",
+ "\n",
+ "A PySpark DataFrame can be created via
`pyspark.sql.SparkSession.createDataFrame` typically by passing a list of
lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas
DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
and an RDD consisting of such a list.\n",
+ "`pyspark.sql.SparkSession.createDataFrame` takes the `schema` argument to
specify the schema of the DataFrame. When it is omitted, PySpark infers the
corresponding schema by taking a sample from the data.\n",
+ "\n",
+ "The example below creates a PySpark DataFrame from a list of rows"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import datetime\n",
+ "import pandas as pd\n",
+ "from pyspark.sql import Row\n",
+ "\n",
+ "df = spark.createDataFrame([\n",
+ " Row(a=1, b=2., c='string1', d=datetime.date(2000, 1, 1),
e=datetime.datetime(2000, 1, 1, 12, 0)),\n",
+ " Row(a=2, b=3., c='string2', d=datetime.date(2000, 2, 1),
e=datetime.datetime(2000, 1, 2, 12, 0)),\n",
+ " Row(a=4, b=5., c='string3', d=datetime.date(2000, 3, 1),
e=datetime.datetime(2000, 1, 3, 12, 0))\n",
+ "])\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Create a PySpark DataFrame with the schema."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = spark.createDataFrame([\n",
+ " (1, 2., 'string1', datetime.date(2000, 1, 1), datetime.datetime(2000,
1, 1, 12, 0)),\n",
+ " (2, 3., 'string2', datetime.date(2000, 2, 1), datetime.datetime(2000,
1, 2, 12, 0)),\n",
+ " (3, 4., 'string3', datetime.date(2000, 3, 1), datetime.datetime(2000,
1, 3, 12, 0))\n",
+ "], schema='a long, b double, c string, d date, e timestamp')\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Create a PySpark DataFrame from a pandas DataFrame"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pandas_df = pd.DataFrame({\n",
+ " 'a': [1, 2, 3],\n",
+ " 'b': [2., 3., 4.],\n",
+ " 'c': ['string1', 'string2', 'string3'],\n",
+ " 'd': [datetime.date(2000, 1, 1), datetime.date(2000, 2, 1),
datetime.date(2000, 3, 1)],\n",
+ " 'e': [datetime.datetime(2000, 1, 1, 12, 0), datetime.datetime(2000,
1, 2, 12, 0), datetime.datetime(2000, 1, 3, 12, 0)]\n",
+ "})\n",
+ "df = spark.createDataFrame(pandas_df)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Create a PySpark DataFrame from an RDD consisting of a list of tuples."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rdd = spark.sparkContext.parallelize([\n",
+ " (1, 2., 'string1', datetime.date(2000, 1, 1), datetime.datetime(2000,
1, 1, 12, 0)),\n",
+ " (2, 3., 'string2', datetime.date(2000, 2, 1), datetime.datetime(2000,
1, 2, 12, 0)),\n",
+ " (3, 4., 'string3', datetime.date(2000, 3, 1), datetime.datetime(2000,
1, 3, 12, 0))\n",
+ "])\n",
+ "df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The DataFrames created above all have the same results and schema."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+---+---+-------+----------+-------------------+\n",
+ "| a| b| c| d| e|\n",
+ "+---+---+-------+----------+-------------------+\n",
+ "| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|\n",
+ "| 2|3.0|string2|2000-02-01|2000-01-02 12:00:00|\n",
+ "| 3|4.0|string3|2000-03-01|2000-01-03 12:00:00|\n",
+ "+---+---+-------+----------+-------------------+\n",
+ "\n",
+ "root\n",
+ " |-- a: long (nullable = true)\n",
+ " |-- b: double (nullable = true)\n",
+ " |-- c: string (nullable = true)\n",
+ " |-- d: date (nullable = true)\n",
+ " |-- e: timestamp (nullable = true)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "# All DataFrames above result same.\n",
+ "df.show()\n",
+ "df.printSchema()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Viewing Data\n",
+ "\n",
+ "The top rows of a DataFrame can be displayed using `DataFrame.show()`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+---+---+-------+----------+-------------------+\n",
+ "| a| b| c| d| e|\n",
+ "+---+---+-------+----------+-------------------+\n",
+ "| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|\n",
+ "+---+---+-------+----------+-------------------+\n",
+ "only showing top 1 row\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "df.show(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Alternatively, you can enable `spark.sql.repl.eagerEval.enabled`
configuration for the eager evaluation of PySpark DataFrame in notebooks such
as Jupyter. The number of rows to show can be controled via
`spark.sql.repl.eagerEval.maxNumRows` configuration."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<table border='1'>\n",
+ "<tr><th>a</th><th>b</th><th>c</th><th>d</th><th>e</th></tr>\n",
+
"<tr><td>1</td><td>2.0</td><td>string1</td><td>2000-01-01</td><td>2000-01-01
12:00:00</td></tr>\n",
+
"<tr><td>2</td><td>3.0</td><td>string2</td><td>2000-02-01</td><td>2000-01-02
12:00:00</td></tr>\n",
+
"<tr><td>3</td><td>4.0</td><td>string3</td><td>2000-03-01</td><td>2000-01-03
12:00:00</td></tr>\n",
+ "</table>\n"
+ ],
+ "text/plain": [
+ "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "spark.conf.set('spark.sql.repl.eagerEval.enabled', True)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The rows can also be shown vertically. This is useful when rows are too
long to show horizontally."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-RECORD 0------------------\n",
+ " a | 1 \n",
+ " b | 2.0 \n",
+ " c | string1 \n",
+ " d | 2000-01-01 \n",
+ " e | 2000-01-01 12:00:00 \n",
+ "only showing top 1 row\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "df.show(1, vertical=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can see the DataFrame's schema and column names as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['a', 'b', 'c', 'd', 'e']"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "root\n",
+ " |-- a: long (nullable = true)\n",
+ " |-- b: double (nullable = true)\n",
+ " |-- c: string (nullable = true)\n",
+ " |-- d: date (nullable = true)\n",
+ " |-- e: timestamp (nullable = true)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "df.printSchema()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Show the summary of the DataFrame"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
Review comment:
I believe this is being discussed in the dev mailing list. Let's discuss
there.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]