Re: [PR] [SPARK-53287][PS] Add ANSI Migration Guide [spark]

via GitHub Mon, 18 Aug 2025 12:07:45 -0700


ueshin commented on code in PR #52034:
URL: https://github.com/apache/spark/pull/52034#discussion_r2283190670



##########
python/docs/source/user_guide/ansi_migration_guide.ipynb:
##########
@@ -0,0 +1,174 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4fa81d13",
+   "metadata": {},
+   "source": [
+    "# ANSI Migration Guide - Pandas API on Spark\n",
+    "ANSI mode is now on by default for Pandas API on Spark. This guide helps 
you understand the key behavior differences you’ll see.\n",
+    "In short, with ANSI mode on, Pandas API on Spark behavior matches native 
pandas in cases where Pandas API on Spark with ANSI off did not."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e1c7952",
+   "metadata": {},
+   "source": [
+    "## Behavior Change\n",
+    "### String Number Comparison\n",
+    "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and 
`'1'` are considered equal.\n",
+    "\n",
+    "**ANSI on:** behaves like pandas, `1 == '1'` is False."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    ">>> pdf[\"int\"] == pdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    "\n",
+    "# ANSI off\n",
+    ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    True\n",
+    "1    True\n",
+    "dtype: bool\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90a4ea8d",
+   "metadata": {},
+   "source": [
+    "### Strict Casting\n",
+    "**ANSI off:** invalid casts (e.g., `'a' → int`) quietly became NULL.\n",
+    "\n",
+    "**ANSI on:** the same casts raise errors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b361febc-4435-4bd1-9ee1-4874413d770c",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"str\": [\"a\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"str\"].astype(int)\n",
+    "Traceback (most recent call last):\n",
+    "...\n",
+    "pyspark.errors.exceptions.captured.NumberFormatException: 
[CAST_INVALID_INPUT] ...\n",
+    ">>> pdf[\"str\"].astype(int)\n",
+    "Traceback (most recent call last):\n",
+    "...\n",
+    "ValueError: invalid literal for int() with base 10: 'a'\n",
+    "\n",
+    "# ANSI off\n",
+    ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+    ">>> psdf[\"str\"].astype(int)\n",
+    "0   NaN\n",
+    "Name: str, dtype: float64\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e11583e2",
+   "metadata": {},
+   "source": [
+    "### MultiIndex.to_series Return\n",
+    "**ANSI off:** returns each row as a list ([1, red]).\n",

Review Comment:
   It doesn't return a list, but as an array type value.



##########
python/docs/source/user_guide/ansi_migration_guide.ipynb:
##########
@@ -0,0 +1,174 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4fa81d13",
+   "metadata": {},
+   "source": [
+    "# ANSI Migration Guide - Pandas API on Spark\n",
+    "ANSI mode is now on by default for Pandas API on Spark. This guide helps 
you understand the key behavior differences you’ll see.\n",
+    "In short, with ANSI mode on, Pandas API on Spark behavior matches native 
pandas in cases where Pandas API on Spark with ANSI off did not."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e1c7952",
+   "metadata": {},
+   "source": [
+    "## Behavior Change\n",
+    "### String Number Comparison\n",
+    "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and 
`'1'` are considered equal.\n",
+    "\n",
+    "**ANSI on:** behaves like pandas, `1 == '1'` is False."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    ">>> pdf[\"int\"] == pdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    "\n",
+    "# ANSI off\n",
+    ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    True\n",
+    "1    True\n",
+    "dtype: bool\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90a4ea8d",
+   "metadata": {},
+   "source": [
+    "### Strict Casting\n",
+    "**ANSI off:** invalid casts (e.g., `'a' → int`) quietly became NULL.\n",
+    "\n",
+    "**ANSI on:** the same casts raise errors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b361febc-4435-4bd1-9ee1-4874413d770c",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"str\": [\"a\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"str\"].astype(int)\n",
+    "Traceback (most recent call last):\n",
+    "...\n",
+    "pyspark.errors.exceptions.captured.NumberFormatException: 
[CAST_INVALID_INPUT] ...\n",
+    ">>> pdf[\"str\"].astype(int)\n",
+    "Traceback (most recent call last):\n",
+    "...\n",
+    "ValueError: invalid literal for int() with base 10: 'a'\n",
+    "\n",
+    "# ANSI off\n",
+    ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+    ">>> psdf[\"str\"].astype(int)\n",
+    "0   NaN\n",
+    "Name: str, dtype: float64\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e11583e2",
+   "metadata": {},
+   "source": [
+    "### MultiIndex.to_series Return\n",
+    "**ANSI off:** returns each row as a list ([1, red]).\n",
+    "\n",
+    "**ANSI on:** returns each row as a tuple ((1, red)), with the Runtime SQL 
Configuration `spark.sql.execution.pandas.structHandlingMode` set to `'row'`."

Review Comment:
   It doesn't return a tuple, but as a struct type value.
   It will be a tuple as a result of `to_pandas()` if the config is `"row"`; 
otherwise depends on whether Arrow is used or not.



##########
python/docs/source/user_guide/ansi_migration_guide.ipynb:
##########
@@ -0,0 +1,174 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4fa81d13",
+   "metadata": {},
+   "source": [
+    "# ANSI Migration Guide - Pandas API on Spark\n",
+    "ANSI mode is now on by default for Pandas API on Spark. This guide helps 
you understand the key behavior differences you’ll see.\n",
+    "In short, with ANSI mode on, Pandas API on Spark behavior matches native 
pandas in cases where Pandas API on Spark with ANSI off did not."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e1c7952",
+   "metadata": {},
+   "source": [
+    "## Behavior Change\n",
+    "### String Number Comparison\n",
+    "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and 
`'1'` are considered equal.\n",
+    "\n",
+    "**ANSI on:** behaves like pandas, `1 == '1'` is False."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    ">>> pdf[\"int\"] == pdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    "\n",
+    "# ANSI off\n",

Review Comment:
   Add this comment for ANSI on.



##########
python/docs/source/user_guide/ansi_migration_guide.ipynb:
##########
@@ -0,0 +1,174 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4fa81d13",
+   "metadata": {},
+   "source": [
+    "# ANSI Migration Guide - Pandas API on Spark\n",
+    "ANSI mode is now on by default for Pandas API on Spark. This guide helps 
you understand the key behavior differences you’ll see.\n",
+    "In short, with ANSI mode on, Pandas API on Spark behavior matches native 
pandas in cases where Pandas API on Spark with ANSI off did not."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e1c7952",
+   "metadata": {},
+   "source": [
+    "## Behavior Change\n",
+    "### String Number Comparison\n",
+    "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and 
`'1'` are considered equal.\n",
+    "\n",
+    "**ANSI on:** behaves like pandas, `1 == '1'` is False."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    ">>> pdf[\"int\"] == pdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    "\n",
+    "# ANSI off\n",
+    ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    True\n",
+    "1    True\n",
+    "dtype: bool\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90a4ea8d",
+   "metadata": {},
+   "source": [
+    "### Strict Casting\n",
+    "**ANSI off:** invalid casts (e.g., `'a' → int`) quietly became NULL.\n",
+    "\n",
+    "**ANSI on:** the same casts raise errors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b361febc-4435-4bd1-9ee1-4874413d770c",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"str\": [\"a\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"str\"].astype(int)\n",
+    "Traceback (most recent call last):\n",
+    "...\n",
+    "pyspark.errors.exceptions.captured.NumberFormatException: 
[CAST_INVALID_INPUT] ...\n",
+    ">>> pdf[\"str\"].astype(int)\n",
+    "Traceback (most recent call last):\n",
+    "...\n",
+    "ValueError: invalid literal for int() with base 10: 'a'\n",
+    "\n",
+    "# ANSI off\n",
+    ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+    ">>> psdf[\"str\"].astype(int)\n",
+    "0   NaN\n",
+    "Name: str, dtype: float64\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e11583e2",
+   "metadata": {},
+   "source": [
+    "### MultiIndex.to_series Return\n",
+    "**ANSI off:** returns each row as a list ([1, red]).\n",
+    "\n",
+    "**ANSI on:** returns each row as a tuple ((1, red)), with the Runtime SQL 
Configuration `spark.sql.execution.pandas.structHandlingMode` set to `'row'`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4671a895-ed40-4bc4-b1bc-fa9fbb86cc18",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> arrays = [[1,  2], [\"red\", \"blue\"]]\n",
+    ">>> pidx = pd.MultiIndex.from_arrays(arrays, names=(\"number\", 
\"color\"))\n",
+    ">>> psidx = ps.from_pandas(pidx)\n",
+    "\n",
+    ">>> spark.conf.set(\"spark.sql.execution.pandas.structHandlingMode\", 
\"row\")\n",
+    ">>> psidx.to_series()\n",
+    "number  color\n",
+    "1       red       (1, red)\n",
+    "2       blue     (2, blue)\n",
+    "dtype: object\n",
+    ">>> pidx.to_series()\n",
+    "number  color\n",
+    "1       red       (1, red)\n",
+    "2       blue     (2, blue)\n",
+    "dtype: object\n",
+    "\n",
+    "# ANSI off\n",
+    ">>> psidx.to_series()\n",
+    "number  color\n",
+    "1       red       [1, red]\n",
+    "2       blue     [2, blue]\n",
+    "dtype: object\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe146afd",
+   "metadata": {},
+   "source": [
+    "## Related Configurations\n",
+    "1. **`compute.fail_on_ansi_mode` (Pandas API on Spark option)**\n",
+    "   - Controls whether Pandas API on Spark fails immediately when ANSI 
mode is enabled.\n",
+    "   - Now overridden by `compute.ansi_mode_support`.\n",
+    "\n",
+    "2. **`compute.ansi_mode_support` (Pandas API on Spark option)**\n",
+    "   - Indicates whether ANSI mode is fully supported.\n",
+    "\n",
+    "3. **`spark.sql.ansi.enabled` (Spark config)**\n",
+    "   - Native Spark setting that controls ANSI mode."

Review Comment:
   Let's reorder the configs:
   1. `spark.sql.ansi.enabled`
   
   This is always the most powerful config to control the whole SQL and pandas 
API behavior.
   If users want to use the whole old behavior, this should be `False` and the 
other configs are not effective.
   
   2. `compute.ansi_mode_support`
   
   Mention that this is effective only when ANSI is enabled.
   
   3. `compute.fail_on_ansi_mode`
   
   Mention that this is effective only when ANSI is enabled and 
`ansi_mode_support` is `False`.
   Setting this to `False` makes it work with the old behavior when ANSI is 
enabled. 



##########
python/docs/source/user_guide/ansi_migration_guide.ipynb:
##########
@@ -0,0 +1,174 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4fa81d13",
+   "metadata": {},
+   "source": [
+    "# ANSI Migration Guide - Pandas API on Spark\n",
+    "ANSI mode is now on by default for Pandas API on Spark. This guide helps 
you understand the key behavior differences you’ll see.\n",
+    "In short, with ANSI mode on, Pandas API on Spark behavior matches native 
pandas in cases where Pandas API on Spark with ANSI off did not."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e1c7952",
+   "metadata": {},
+   "source": [
+    "## Behavior Change\n",
+    "### String Number Comparison\n",
+    "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and 
`'1'` are considered equal.\n",
+    "\n",
+    "**ANSI on:** behaves like pandas, `1 == '1'` is False."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee",
+   "metadata": {},
+   "source": [
+    "Examples are as shown below:\n",
+    "\n",
+    "```python\n",
+    ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n",
+    ">>> psdf = ps.from_pandas(pdf)\n",
+    ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",
+    ">>> pdf[\"int\"] == pdf[\"str\"]\n",
+    "0    False\n",
+    "1    False\n",
+    "dtype: bool\n",

Review Comment:
   Let's show the pandas result first or at last. Mixing them is confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53287][PS] Add ANSI Migration Guide [spark]

Reply via email to