HyukjinKwon commented on code in PR #42284:
URL: https://github.com/apache/spark/pull/42284#discussion_r1284959852
##########
python/docs/source/getting_started/testing_pyspark.ipynb:
##########
@@ -0,0 +1,486 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "4ee2125b-f889-47e6-9c3d-8bd63a253683",
+ "metadata": {},
+ "source": [
+ "# Testing PySpark\n",
+ "\n",
+ "This guide is a reference for writing robust tests for PySpark code.\n",
+ "\n",
+ "To view the docs for PySpark test utils, see here. To see the code for
PySpark built-in test utils, check out the Spark repository here. To see the
JIRA board tickets for the PySpark test framework, see here."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0e8ee4b6-9544-45e1-8a91-e71ed8ef8b9d",
+ "metadata": {},
+ "source": [
+ "## Build a PySpark Application\n",
+ "Here is an example for how to start a PySpark application. Feel free to
skip to the next section, “Testing your PySpark Application,” if you already
have an application you’re ready to test.\n",
+ "\n",
+ "First, start your Spark Session."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "9af4a35b-17e8-4e45-816b-34c14c5902f7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pyspark.sql import SparkSession \n",
+ "from pyspark.sql.functions import col \n",
+ "\n",
+ "# Create a SparkSession \n",
+ "spark = SparkSession.builder.appName(\"Testing PySpark
Example\").getOrCreate() "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4a4c6efe-91f5-4e18-b4b2-b0401c2368e4",
+ "metadata": {},
+ "source": [
+ "Next, create a DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "3b483dd8-3a76-41c6-9206-301d7ef314d6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sample_data = [{\"name\": \"John D.\", \"age\": 30}, \n",
+ " {\"name\": \"Alice G.\", \"age\": 25}, \n",
+ " {\"name\": \"Bob T.\", \"age\": 35}, \n",
+ " {\"name\": \"Eve A.\", \"age\": 28}] \n",
+ "\n",
+ "df = spark.createDataFrame(sample_data)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0f44333-0e08-470b-9fa2-38f59e3dbd63",
+ "metadata": {},
+ "source": [
+ "Now, let’s define and apply a transformation function to our DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "a6c0b766-af5f-4e1d-acf8-887d7cf0b0b2",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+---+--------+\n",
+ "|age| name|\n",
+ "+---+--------+\n",
+ "| 30| John D.|\n",
+ "| 25|Alice G.|\n",
+ "| 35| Bob T.|\n",
+ "| 28| Eve A.|\n",
+ "+---+--------+\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "from pyspark.sql.functions import col, regexp_replace\n",
+ "\n",
+ "# Remove additional spaces in name\n",
+ "def remove_extra_spaces(df, column_name):\n",
+ " # Remove extra spaces from the specified column\n",
+ " df_transformed = df.withColumn(column_name,
regexp_replace(col(column_name), \"\\\\s+\", \" \"))\n",
+ " \n",
+ " return df_transformed\n",
+ "\n",
+ "transformed_df = remove_extra_spaces(df, \"name\")\n",
+ "\n",
+ "transformed_df.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "530beaa6-aabf-43a1-ad2b-361f267e9608",
+ "metadata": {},
+ "source": [
+ "## Testing your PySpark Application\n",
+ "Now let’s test our PySpark transformation function. \n",
+ "\n",
+ "One option is to simply eyeball the resulting DataFrame. However, this
can be impractical for large DataFrame or input sizes.\n",
+ "\n",
+ "A better way is to write tests. Here are some examples of how we can test
our code. The examples below apply for Spark 3.5 and above versions.\n",
+ "\n",
+ "Note that these examples are not exhaustive, as there are many other test
framework alternatives which you can use instead of `unittest` or `pytest`. The
built-in PySpark testing util functions are standalone, meaning they can be
compatible with any test framework or CI test pipeline.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d84a9fc1-9768-4af4-bfbf-e832f23334dc",
+ "metadata": {},
+ "source": [
+ "### Option 1: Using Only PySpark Built-in Test Utility Functions\n",
+ "\n",
+ "For simple ad-hoc validation cases, PySpark testing utils like
`assertDataFrameEqual` and `assertSchemaEqual` can be used in a standalone
context.\n",
+ "You could easily test PySpark code in a notebook session. For example,
say you want to assert equality between two DataFrames:\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "8e533732-ee40-4cd0-9669-8eb92973908a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pyspark.testing\n",
+ "\n",
Review Comment:
remove newline
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]