[PR] [SPARK-54939][PYTHON][TESTS] Add tests for pa.array type coercion [spark]

via GitHub Wed, 07 Jan 2026 17:04:24 -0800


Yicong-Huang opened a new pull request, #53721:
URL: https://github.com/apache/spark/pull/53721


   ### What changes were proposed in this pull request?
   
   Add comprehensive tests for PyArrow's `pa.array` type coercion behavior when 
using explicit `type` parameter. These tests monitor upstream PyArrow behavior 
to ensure PySpark's assumptions remain valid across versions.
   
   The tests cover 12 categories with 51 test cases:
   
   | Category | Tests |
   |----------|-------|
   | **Missing Values** | `None`, `NaN`, `pd.NaT`, `pd.NA` handling |
   | **Empty Datasets** | Empty lists, pandas Series, numpy arrays |
   | **Invalid Values** | Overflow, precision loss, incompatible types |
   | **Numeric Coercion** | int ↔ float, int narrowing/widening, unsigned |
   | **Decimal Coercion** | int → decimal128/256 with various precision/scale |
   | **String Coercion** | int → string (requires explicit `pc.cast()`) |
   | **Boolean Coercion** | int → bool (requires explicit `pc.cast()`) |
   | **Temporal Coercion** | timestamp, date, time, duration resolutions |
   | **Timezone Coercion** | `datetime.datetime` and `pd.Timestamp` with 
timezones (UTC, Asia/Singapore, America/Los_Angeles) |
   | **Nested Types** | list, struct, map element type coercion |
   | **Input Types** | Python list/tuple/generator, Pandas Series, NumPy arrays 
|
   | **Spark Types** | All Spark numeric, decimal, temporal, string, binary, 
complex types |
   
   Key coercion behaviors documented:
   
   | Coercion | Implicit | Notes |
   |----------|----------|-------|
   | int → float64/float32 | ✅ | Works directly |
   | float → int64 | ✅ | Truncates fractional part |
   | int → decimal128 | ✅ | Python list only |
   | int → string | ❌ | Requires `pc.cast()` |
   | int → bool | ❌ | Requires `pc.cast()` |
   | float ↔ decimal | ❌ | Requires `pc.cast()` |
   | numpy/pandas → decimal | ❌ | Requires sufficient precision for ArrowDtype |
   | timezone-aware datetime | ✅ | Converts between timezones |
   
   ### Why are the changes needed?
   
   This is part of 
[SPARK-54936](https://issues.apache.org/jira/browse/SPARK-54936) to monitor 
behavior changes from upstream dependencies. By testing PyArrow's type coercion 
behavior, we can detect breaking changes when upgrading PyArrow versions.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This PR only adds tests.
   
   ### How was this patch tested?
   
   New unit tests added:
   ```bash
   python -m pytest 
python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_coercion.py -v
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54939][PYTHON][TESTS] Add tests for pa.array type coercion [spark]

Reply via email to