This is an automated email from the ASF dual-hosted git repository.

HyukjinKwon pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-4.x by this push:
     new 21ffb2e3af42 [SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type 
coercion golden in memory for Pandas 3
21ffb2e3af42 is described below

commit 21ffb2e3af428f23754457313c153b53b9c1e70f
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Tue May 12 07:09:02 2026 +0900

    [SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in 
memory for Pandas 3
    
    ### What changes were proposed in this pull request?
    
    In `PandasUDFInputTypeTests._compare_or_generate_golden`, after loading the 
golden CSV, replace `'object'` with `'str'` in the `Python Type` column for 
rows whose `Spark Type` is `string`, but only when running under Pandas `>= 
3.0`. The golden file on disk is unchanged.
    
    ### Why are the changes needed?
    
    The daily-scheduled `Build / Python-only (master, Python 3.12, Pandas 3)` 
workflow is failing `test_pandas_input_type_coercion_vanilla`:
    
    ```
    line mismatch: expects ['string_values', 'string', "['abc', '', 'hello']", 
"['object', 'object', 'object']", ...] but got [..., "['str', 'str', 'str']", 
...]
    line mismatch: expects ['string_null', 'string', "[None, 'test']", 
"['object', 'object']", ...] but got [..., "['str', 'str']", ...]
    ```
    
    Pandas 3.0 changed the default representation of string columns from numpy 
`object` dtype to the dedicated `str` dtype, so the pandas UDF that records 
`str(series.dtype)` now reports `'str'` for string inputs and the recorded 
`Python Type` column no longer matches the golden. The values themselves are 
unchanged.
    
    See 
https://github.com/apache/spark/actions/runs/25611987987/job/75184547983 for 
the failing run.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Test-only change.
    
    ### How was this patch tested?
    
    Ran the full test module under a Pandas 3 conda env (Python 3.13.12, Pandas 
3.0.2, PyArrow 23.0.1, NumPy 2.4.3):
    
    ```
    $ python/run-tests --testnames 
"pyspark.sql.tests.coercion.test_pandas_udf_input_type"
    Tests passed in 17 seconds
    ```
    
    The Pandas `< 3.0` branch is a no-op, so existing behaviour is preserved.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Code (claude-opus-4-7)
    
    Closes #55793 from zhengruifeng/pandas3-pandas-udf-input-type-coercion.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    (cherry picked from commit 5d210706d12fb681ec64d10e5006f94cf8db943c)
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py 
b/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py
index 64377f2df698..a77a750e4684 100644
--- a/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py
+++ b/python/pyspark/sql/tests/coercion/test_pandas_udf_input_type.py
@@ -251,6 +251,14 @@ class PandasUDFInputTypeTests(GoldenFileTestMixin, 
ReusedSQLTestCase):
         golden = None
         if not generating:
             golden = self.load_golden_csv(golden_csv)
+            # Pandas >= 3.0 reports the dedicated 'str' dtype for string 
columns,
+            # whereas earlier versions report 'object'. Patch the in-memory 
golden
+            # so the same file works under both versions.
+            if LooseVersion(pd.__version__) >= LooseVersion("3.0.0"):
+                str_rows = golden["Spark Type"] == "string"
+                golden.loc[str_rows, "Python Type"] = golden.loc[
+                    str_rows, "Python Type"
+                ].str.replace("'object'", "'str'")
 
         results = []
         for idx, (case_name, spark_type, data_func) in 
enumerate(self.test_cases):


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to