This is an automated email from the ASF dual-hosted git repository.
zhengruifeng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c57777e9c867 [SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests
for pandas 3
c57777e9c867 is described below
commit c57777e9c867d1d87f42328d28281013f39f19f0
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Wed May 20 10:14:42 2026 +0800
[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3
### What changes were proposed in this pull request?
Make
`pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTests`
work under pandas >= 3.0 and on systems whose `tzdata` package no longer ships
the legacy `US/*` aliases (e.g. Ubuntu 24.04 / noble).
1. **Switch the tz-aware fixture from `US/Eastern` to `America/New_York`.**
The values returned by `pd.date_range(...).values` are identical for the two
aliases (same zone, same DST rules), so the on-disk golden file does not need
to be regenerated.
2. **Patch the loaded golden DataFrame in memory for pandas >= 3.0.** The
golden file was generated under pandas 2 and the on-disk content is unchanged.
At load time, when running under pandas >= 3.0, the test:
- Renames column keys whose representation differs between the two
versions: datetime64 ndarrays default to `[us]` instead of `[ns]`, and
`pd.Categorical` keeps `str`-dtyped categories instead of `object`.
- Scales 13+ digit integers in cells of datetime64 / Timedelta-list
columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned
nanoseconds for the same cast (e.g. `bigint <- pd.date_range(...).values` flips
from `86_400_000_000_000` to `86_400_000_000`).
- Overrides the single `decimal(10,0) x ['12','34']list` cell, which
flipped from `X` (pandas 2 errored) to `[Decimal('12'), Decimal('34')]` (pandas
3 succeeds at the string -> Decimal coercion).
### Why are the changes needed?
The scheduled CI run on the `python-312-pandas-3` image fails in this
suite, e.g.
https://github.com/apache/spark/actions/runs/26002965955/job/76430490989. Root
causes:
- `pd.date_range("19700101", periods=2, tz="US/Eastern").values` raises
`zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key
US/Eastern'`. Pandas 3 dropped `pytz` as a hard dependency and now resolves tz
names through stdlib `zoneinfo`, which on Ubuntu 24.04 cannot find `US/Eastern`
because Ubuntu moved the legacy aliases out of `tzdata` into a separate
`tzdata-legacy` package that the CI image does not install.
- After the alias fix, `golden.loc[str_t, str_v]` raises `KeyError` because
the column keys in the golden file are pandas-2-shaped (`datetime64[ns]`,
`Categorical(..., object)`) but the lookup keys built at runtime are
pandas-3-shaped (`datetime64[us]`, `Categorical(..., str)`).
- After the key rename, assertions still fail because the cast result
values themselves changed: nanoseconds -> microseconds for datetime / Timedelta
inputs, and one cell where pandas 3 now succeeds where pandas 2 errored.
### Does this PR introduce _any_ user-facing change?
No. Test-only change.
### How was this patch tested?
Ran the suite locally under two envs:
```
# pandas 2.3.3 / Python 3.13
$ python/run-tests --testnames
"pyspark.sql.tests.coercion.test_pandas_udf_return_type
PandasUDFReturnTypeTests"
...
Tests passed in 31 seconds
# pandas 3.0.2 / Python 3.13
$ python/run-tests --testnames
"pyspark.sql.tests.coercion.test_pandas_udf_return_type
PandasUDFReturnTypeTests"
...
Tests passed in 30 seconds
```
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code
Closes #55974 from zhengruifeng/SPARK-fix-tz-uneastern.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
.../tests/coercion/test_pandas_udf_return_type.py | 45 +++++++++++++++++++++-
1 file changed, 44 insertions(+), 1 deletion(-)
diff --git a/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py
b/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py
index 454fe726f95c..f1ba3cd84723 100644
--- a/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py
+++ b/python/pyspark/sql/tests/coercion/test_pandas_udf_return_type.py
@@ -104,7 +104,7 @@ class PandasUDFReturnTypeTests(GoldenFileTestMixin,
ReusedSQLTestCase):
np.arange(1, 3).astype("complex128"),
[np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3],
dtype=np.int32)],
pd.date_range("19700101", periods=2).values,
- pd.date_range("19700101", periods=2, tz="US/Eastern").values,
+ pd.date_range("19700101", periods=2, tz="America/New_York").values,
[pd.Timedelta("1 day"), pd.Timedelta("2 days")],
pd.Categorical(["A", "B"]),
pd.DataFrame({"_1": [1, 2]}),
@@ -160,6 +160,49 @@ class PandasUDFReturnTypeTests(GoldenFileTestMixin,
ReusedSQLTestCase):
golden = None
if not generating:
golden = self.load_golden_csv(golden_csv)
+ # The golden file was generated under pandas 2; patch the loaded
+ # copy in memory so the same file works under pandas >= 3.0, where
+ # the defaults differ: datetime64 ndarrays use [us] instead of
[ns],
+ # Categorical categories use str instead of object, and the same
+ # casts return microseconds instead of nanoseconds.
+ if LooseVersion(pd.__version__) >= LooseVersion("3.0.0"):
+ rename = {}
+ scale_cols = []
+ for value in self.test_data:
+ new_key = self.repr_value(value)
+ if isinstance(value, np.ndarray) and value.dtype.kind ==
"M":
+ old_key =
self.repr_value(value.astype("datetime64[ns]"))
+ if old_key != new_key:
+ rename[old_key] = new_key
+ scale_cols.append(new_key)
+ elif isinstance(value, pd.Categorical) and
value.categories.dtype != object:
+ old_key = self.repr_value(
+ pd.Categorical(
+ value.tolist(),
+ categories=pd.Index(value.categories.tolist(),
dtype=object),
+ )
+ )
+ if old_key != new_key:
+ rename[old_key] = new_key
+ elif isinstance(value, list) and value and
isinstance(value[0], pd.Timedelta):
+ scale_cols.append(new_key)
+
+ if rename:
+ golden.rename(columns=rename, inplace=True)
+
+ for col in scale_cols:
+ golden[col] = golden[col].str.replace(
+ r"\d{13,}",
+ lambda m: str(int(m.group()) // 1000),
+ regex=True,
+ )
+
+ # Pandas 3 succeeds at coercing string list -> Decimal where
+ # pandas 2 errored, so the corresponding cell flips from "X".
+ decimal_idx = self.repr_type(DecimalType(10, 0))
+ decimal_col = self.repr_value(["12", "34"])
+ if decimal_idx in golden.index and decimal_col in
golden.columns:
+ golden.loc[decimal_idx, decimal_col] = "[Decimal('12'),
Decimal('34')]"
def work(arg):
spark_type, value = arg
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]