This is an automated email from the ASF dual-hosted git repository.
HyukjinKwon pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.x by this push:
new 21ffb2e3af42 [SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type
coercion golden in memory for Pandas 3
21ffb2e3af42 is described below
commit 21ffb2e3af428f23754457313c153b53b9c1e70f
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Tue May 12 07:09:02 2026 +0900
[SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in
memory for Pandas 3
### What changes were proposed in this pull request?
In `PandasUDFInputTypeTests._compare_or_generate_golden`, after loading the
golden CSV, replace `'object'` with `'str'` in the `Python Type` column for
rows whose `Spark Type` is `string`, but only when running under Pandas `>=
3.0`. The golden file on disk is unchanged.
### Why are the changes needed?
The daily-scheduled `Build / Python-only (master, Python 3.12, Pandas 3)`
workflow is failing `test_pandas_input_type_coercion_vanilla`:
```
line mismatch: expects ['string_values', 'string', "['abc', '', 'hello']",
"['object', 'object', 'object']", ...] but got [..., "['str', 'str', 'str']",
...]
line mismatch: expects ['string_null', 'string', "[None, 'test']",
"['object', 'object']", ...] but got [..., "['str', 'str']", ...]
```
Pandas 3.0 changed the default representation of string columns from numpy
`object` dtype to the dedicated `str` dtype, so the pandas UDF that records
`str(series.dtype)` now reports `'str'` for string inputs and the recorded
`Python Type` column no longer matches the golden. The values themselves are
unchanged.
See
https://github.com/apache/spark/actions/runs/25611987987/job/75184547983 for
the failing run.
### Does this PR introduce _any_ user-facing change?
No. Test-only change.
### How was this patch tested?
Ran the full test module under a Pandas 3 conda env (Python 3.13.12, Pandas
3.0.2, PyArrow 23.0.1, NumPy 2.4.3):
```
$ python/run-tests --testnames
"pyspark.sql.tests.coercion.test_pandas_udf_input_type"
Tests passed in 17 seconds
```
The Pandas `< 3.0` branch is a no-op, so existing behaviour is preserved.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)
Closes #55793 from zhengruifeng/pandas3-pandas-udf-input-type-coercion.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 5d210706d12fb681ec64d10e5006f94cf8db943c)
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py
b/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py
index 64377f2df698..a77a750e4684 100644
--- a/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py
+++ b/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py
@@ -251,6 +251,14 @@ class PandasUDFInputTypeTests(GoldenFileTestMixin,
ReusedSQLTestCase):
golden = None
if not generating:
golden = self.load_golden_csv(golden_csv)
+ # Pandas >= 3.0 reports the dedicated 'str' dtype for string
columns,
+ # whereas earlier versions report 'object'. Patch the in-memory
golden
+ # so the same file works under both versions.
+ if LooseVersion(pd.__version__) >= LooseVersion("3.0.0"):
+ str_rows = golden["Spark Type"] == "string"
+ golden.loc[str_rows, "Python Type"] = golden.loc[
+ str_rows, "Python Type"
+ ].str.replace("'object'", "'str'")
results = []
for idx, (case_name, spark_type, data_func) in
enumerate(self.test_cases):
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]