xinrong-meng commented on code in PR #52034:
URL: https://github.com/apache/spark/pull/52034#discussion_r2286076565
##########
python/docs/source/user_guide/ansi_migration_guide.ipynb:
##########
@@ -0,0 +1,174 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "4fa81d13",
+ "metadata": {},
+ "source": [
+ "# ANSI Migration Guide - Pandas API on Spark\n",
+ "ANSI mode is now on by default for Pandas API on Spark. This guide helps
you understand the key behavior differences you’ll see.\n",
+ "In short, with ANSI mode on, Pandas API on Spark behavior matches native
pandas in cases where Pandas API on Spark with ANSI off did not."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6e1c7952",
+ "metadata": {},
+ "source": [
+ "## Behavior Change\n",
+ "### String Number Comparison\n",
+ "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and
`'1'` are considered equal.\n",
+ "\n",
+ "**ANSI on:** behaves like pandas, `1 == '1'` is False."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee",
+ "metadata": {},
+ "source": [
+ "Examples are as shown below:\n",
+ "\n",
+ "```python\n",
+ ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n",
+ ">>> psdf = ps.from_pandas(pdf)\n",
+ ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+ "0 False\n",
+ "1 False\n",
+ "dtype: bool\n",
+ ">>> pdf[\"int\"] == pdf[\"str\"]\n",
+ "0 False\n",
+ "1 False\n",
+ "dtype: bool\n",
+ "\n",
+ "# ANSI off\n",
+ ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+ ">>> psdf[\"int\"] == psdf[\"str\"]\n",
+ "0 True\n",
+ "1 True\n",
+ "dtype: bool\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "90a4ea8d",
+ "metadata": {},
+ "source": [
+ "### Strict Casting\n",
+ "**ANSI off:** invalid casts (e.g., `'a' → int`) quietly became NULL.\n",
+ "\n",
+ "**ANSI on:** the same casts raise errors."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b361febc-4435-4bd1-9ee1-4874413d770c",
+ "metadata": {},
+ "source": [
+ "Examples are as shown below:\n",
+ "\n",
+ "```python\n",
+ ">>> pdf = pd.DataFrame({\"str\": [\"a\"]})\n",
+ ">>> psdf = ps.from_pandas(pdf)\n",
+ ">>> psdf[\"str\"].astype(int)\n",
+ "Traceback (most recent call last):\n",
+ "...\n",
+ "pyspark.errors.exceptions.captured.NumberFormatException:
[CAST_INVALID_INPUT] ...\n",
+ ">>> pdf[\"str\"].astype(int)\n",
+ "Traceback (most recent call last):\n",
+ "...\n",
+ "ValueError: invalid literal for int() with base 10: 'a'\n",
+ "\n",
+ "# ANSI off\n",
+ ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n",
+ ">>> psdf[\"str\"].astype(int)\n",
+ "0 NaN\n",
+ "Name: str, dtype: float64\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e11583e2",
+ "metadata": {},
+ "source": [
+ "### MultiIndex.to_series Return\n",
+ "**ANSI off:** returns each row as a list ([1, red]).\n",
Review Comment:
Adjusted.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]