(spark) branch master updated: [SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3

ruifengz Tue, 19 May 2026 19:15:01 -0700

This is an automated email from the ASF dual-hosted git repository.

zhengruifeng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new c57777e9c867 [SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests 
for pandas 3
c57777e9c867 is described below

commit c57777e9c867d1d87f42328d28281013f39f19f0
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Wed May 20 10:14:42 2026 +0800

    [SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3
    
    ### What changes were proposed in this pull request?
    
    Make 
`pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTests`
 work under pandas >= 3.0 and on systems whose `tzdata` package no longer ships 
the legacy `US/*` aliases (e.g. Ubuntu 24.04 / noble).
    
    1. **Switch the tz-aware fixture from `US/Eastern` to `America/New_York`.** 
The values returned by `pd.date_range(...).values` are identical for the two 
aliases (same zone, same DST rules), so the on-disk golden file does not need 
to be regenerated.
    
    2. **Patch the loaded golden DataFrame in memory for pandas >= 3.0.** The 
golden file was generated under pandas 2 and the on-disk content is unchanged. 
At load time, when running under pandas >= 3.0, the test:
       - Renames column keys whose representation differs between the two 
versions: datetime64 ndarrays default to `[us]` instead of `[ns]`, and 
`pd.Categorical` keeps `str`-dtyped categories instead of `object`.
       - Scales 13+ digit integers in cells of datetime64 / Timedelta-list 
columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned 
nanoseconds for the same cast (e.g. `bigint <- pd.date_range(...).values` flips 
from `86_400_000_000_000` to `86_400_000_000`).
       - Overrides the single `decimal(10,0) x ['12','34']list` cell, which 
flipped from `X` (pandas 2 errored) to `[Decimal('12'), Decimal('34')]` (pandas 
3 succeeds at the string -> Decimal coercion).
    
    ### Why are the changes needed?
    
    The scheduled CI run on the `python-312-pandas-3` image fails in this 
suite, e.g. 
https://github.com/apache/spark/actions/runs/26002965955/job/76430490989. Root 
causes:
    
    - `pd.date_range("19700101", periods=2, tz="US/Eastern").values` raises 
`zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key 
US/Eastern'`. Pandas 3 dropped `pytz` as a hard dependency and now resolves tz 
names through stdlib `zoneinfo`, which on Ubuntu 24.04 cannot find `US/Eastern` 
because Ubuntu moved the legacy aliases out of `tzdata` into a separate 
`tzdata-legacy` package that the CI image does not install.
    - After the alias fix, `golden.loc[str_t, str_v]` raises `KeyError` because 
the column keys in the golden file are pandas-2-shaped (`datetime64[ns]`, 
`Categorical(..., object)`) but the lookup keys built at runtime are 
pandas-3-shaped (`datetime64[us]`, `Categorical(..., str)`).
    - After the key rename, assertions still fail because the cast result 
values themselves changed: nanoseconds -> microseconds for datetime / Timedelta 
inputs, and one cell where pandas 3 now succeeds where pandas 2 errored.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Test-only change.
    
    ### How was this patch tested?
    
    Ran the suite locally under two envs:
    
    ```
    # pandas 2.3.3 / Python 3.13
    $ python/run-tests --testnames 
"pyspark.sql.tests.coercion.test_pandas_udf_return_type 
PandasUDFReturnTypeTests"
    ...
    Tests passed in 31 seconds
    
    # pandas 3.0.2 / Python 3.13
    $ python/run-tests --testnames 
"pyspark.sql.tests.coercion.test_pandas_udf_return_type 
PandasUDFReturnTypeTests"
    ...
    Tests passed in 30 seconds
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Code
    
    Closes #55974 from zhengruifeng/SPARK-fix-tz-uneastern.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 .../tests/coercion/test_pandas_udf_return_type.py  | 45 +++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py 
b/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py
index 454fe726f95c..f1ba3cd84723 100644
--- a/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py
+++ b/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py
@@ -104,7 +104,7 @@ class PandasUDFReturnTypeTests(GoldenFileTestMixin, 
ReusedSQLTestCase):
             np.arange(1, 3).astype("complex128"),
             [np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], 
dtype=np.int32)],
             pd.date_range("19700101", periods=2).values,
-            pd.date_range("19700101", periods=2, tz="US/Eastern").values,
+            pd.date_range("19700101", periods=2, tz="America/New_York").values,
             [pd.Timedelta("1 day"), pd.Timedelta("2 days")],
             pd.Categorical(["A", "B"]),
             pd.DataFrame({"_1": [1, 2]}),
@@ -160,6 +160,49 @@ class PandasUDFReturnTypeTests(GoldenFileTestMixin, 
ReusedSQLTestCase):
         golden = None
         if not generating:
             golden = self.load_golden_csv(golden_csv)
+            # The golden file was generated under pandas 2; patch the loaded
+            # copy in memory so the same file works under pandas >= 3.0, where
+            # the defaults differ: datetime64 ndarrays use [us] instead of 
[ns],
+            # Categorical categories use str instead of object, and the same
+            # casts return microseconds instead of nanoseconds.
+            if LooseVersion(pd.__version__) >= LooseVersion("3.0.0"):
+                rename = {}
+                scale_cols = []
+                for value in self.test_data:
+                    new_key = self.repr_value(value)
+                    if isinstance(value, np.ndarray) and value.dtype.kind == 
"M":
+                        old_key = 
self.repr_value(value.astype("datetime64[ns]"))
+                        if old_key != new_key:
+                            rename[old_key] = new_key
+                        scale_cols.append(new_key)
+                    elif isinstance(value, pd.Categorical) and 
value.categories.dtype != object:
+                        old_key = self.repr_value(
+                            pd.Categorical(
+                                value.tolist(),
+                                categories=pd.Index(value.categories.tolist(), 
dtype=object),
+                            )
+                        )
+                        if old_key != new_key:
+                            rename[old_key] = new_key
+                    elif isinstance(value, list) and value and 
isinstance(value[0], pd.Timedelta):
+                        scale_cols.append(new_key)
+
+                if rename:
+                    golden.rename(columns=rename, inplace=True)
+
+                for col in scale_cols:
+                    golden[col] = golden[col].str.replace(
+                        r"\d{13,}",
+                        lambda m: str(int(m.group()) // 1000),
+                        regex=True,
+                    )
+
+                # Pandas 3 succeeds at coercing string list -> Decimal where
+                # pandas 2 errored, so the corresponding cell flips from "X".
+                decimal_idx = self.repr_type(DecimalType(10, 0))
+                decimal_col = self.repr_value(["12", "34"])
+                if decimal_idx in golden.index and decimal_col in 
golden.columns:
+                    golden.loc[decimal_idx, decimal_col] = "[Decimal('12'), 
Decimal('34')]"
 
         def work(arg):
             spark_type, value = arg


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3

Reply via email to