Yicong-Huang opened a new pull request, #53721: URL: https://github.com/apache/spark/pull/53721
### What changes were proposed in this pull request? Add comprehensive tests for PyArrow's `pa.array` type coercion behavior when using explicit `type` parameter. These tests monitor upstream PyArrow behavior to ensure PySpark's assumptions remain valid across versions. The tests cover 12 categories with 51 test cases: | Category | Tests | |----------|-------| | **Missing Values** | `None`, `NaN`, `pd.NaT`, `pd.NA` handling | | **Empty Datasets** | Empty lists, pandas Series, numpy arrays | | **Invalid Values** | Overflow, precision loss, incompatible types | | **Numeric Coercion** | int ↔ float, int narrowing/widening, unsigned | | **Decimal Coercion** | int → decimal128/256 with various precision/scale | | **String Coercion** | int → string (requires explicit `pc.cast()`) | | **Boolean Coercion** | int → bool (requires explicit `pc.cast()`) | | **Temporal Coercion** | timestamp, date, time, duration resolutions | | **Timezone Coercion** | `datetime.datetime` and `pd.Timestamp` with timezones (UTC, Asia/Singapore, America/Los_Angeles) | | **Nested Types** | list, struct, map element type coercion | | **Input Types** | Python list/tuple/generator, Pandas Series, NumPy arrays | | **Spark Types** | All Spark numeric, decimal, temporal, string, binary, complex types | Key coercion behaviors documented: | Coercion | Implicit | Notes | |----------|----------|-------| | int → float64/float32 | ✅ | Works directly | | float → int64 | ✅ | Truncates fractional part | | int → decimal128 | ✅ | Python list only | | int → string | ❌ | Requires `pc.cast()` | | int → bool | ❌ | Requires `pc.cast()` | | float ↔ decimal | ❌ | Requires `pc.cast()` | | numpy/pandas → decimal | ❌ | Requires sufficient precision for ArrowDtype | | timezone-aware datetime | ✅ | Converts between timezones | ### Why are the changes needed? This is part of [SPARK-54936](https://issues.apache.org/jira/browse/SPARK-54936) to monitor behavior changes from upstream dependencies. By testing PyArrow's type coercion behavior, we can detect breaking changes when upgrading PyArrow versions. ### Does this PR introduce _any_ user-facing change? No. This PR only adds tests. ### How was this patch tested? New unit tests added: ```bash python -m pytest python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_coercion.py -v ``` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
