zhengruifeng opened a new pull request, #55974:
URL: https://github.com/apache/spark/pull/55974

   ### What changes were proposed in this pull request?
   
   This PR makes 
`pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTests`
 work under pandas >= 3.0 and on systems whose `tzdata` package no longer ships 
the legacy `US/*` aliases (e.g. Ubuntu 24.04 / noble).
   
   Two changes:
   
   1. **Switch the tz-aware fixture from `US/Eastern` to `America/New_York`.** 
The values returned by `pd.date_range(...).values` are identical for the two 
aliases (both are UTC-5 with the same DST rules), so the golden file does not 
need to be regenerated.
   
   2. **Remap the loaded golden DataFrame in memory for pandas >= 3.0.** The 
on-disk golden file was generated under pandas 2, where the default datetime 
ndarray resolution is `datetime64[ns]` and `pd.Categorical` keeps 
`object`-dtyped categories. Under pandas 3 those defaults are `datetime64[us]` 
and `str`-dtyped categories. The lookup keys built by `repr_value` therefore no 
longer match the golden column names. We rebuild the affected column names at 
load time (without touching the file on disk) so the same golden works for both 
pandas versions.
   
   ### Why are the changes needed?
   
   Currently scheduled CI runs on the `python-312-pandas-3` image fail in this 
suite:
   
   - `pd.date_range(\"19700101\", periods=2, tz=\"US/Eastern\").values` raises 
`zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key 
US/Eastern'`. pandas 3 dropped `pytz` as a hard dependency and resolves tz 
names through stdlib `zoneinfo`, which on Ubuntu 24.04 cannot find `US/Eastern` 
because Ubuntu moved the legacy aliases out of `tzdata` into a separate 
`tzdata-legacy` package that the CI image does not install. Example failure: 
https://github.com/apache/spark/actions/runs/26002965955/job/76430490989
   - After the alias fix, `golden.loc[str_t, str_v]` raises `KeyError` because 
the column keys in the golden file are pandas-2-shaped (`datetime64[ns]`, 
`Categorical(..., object)`) but the lookup keys built at runtime are 
pandas-3-shaped (`datetime64[us]`, `Categorical(..., str)`).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-only change.
   
   ### How was this patch tested?
   
   Ran the suite locally with pandas 3.0.2 (python 3.13):
   
   ```
   python/run-tests --testnames 
\"pyspark.sql.tests.coercion.test_pandas_udf_return_type 
PandasUDFReturnTypeTests\"
   ```
   
   The previous `ZoneInfoNotFoundError` and `KeyError` errors are gone. Note: 
there are still a few remaining pandas-3 assertion mismatches caused by the 
underlying nanosecond->microsecond resolution change propagating into cast 
results (e.g. \`bigint\` row for datetime/timedelta columns), and one cell 
where pandas 3 succeeds where pandas 2 errored (\`['12', '34']@list\` vs 
\`decimal(10,0)\`). Those are pre-existing pandas-3 incompatibilities and are 
out of scope for this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to