Yicong-Huang opened a new pull request, #56380: URL: https://github.com/apache/spark/pull/56380
### What changes were proposed in this pull request? This is a follow-up to #56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas`), which introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`). In `python/pyspark/sql/pandas/conversion.py`, after `pa.Array.from_pandas(...)`, this PR adds a narrowly-scoped conditional cast back to the requested Arrow type. The cast only fires when all of the following hold: the requested `arrow_type` is not `None`, the installed pyarrow is older than 19.0.0, the produced array type does not already equal `arrow_type`, and the mismatch is exactly an offset-width difference between `string`/`large_string` or `binary`/`large_binary` (detected by the new `_is_string_or_binary_width_mismatch` helper). For any other version or type mismatch the result is returned untouched. ### Why are the changes needed? Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by `large_string` (64-bit offsets). When such a series is converted via the `__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a `large_string` (or `large_binary`) array even though `string` (or `binary`) was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data. pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This PR replicates that behavior for older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces correct data instead of corrupt rows. ### Does this PR introduce _any_ user-facing change? Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing `string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change. ### How was this patch tested? Added unit tests in `python/pyspark/sql/tests/test_conversion.py`: - `test_string_or_binary_width_mismatch_helper` covers the `_is_string_or_binary_width_mismatch` helper across both offset-width directions for string and binary, plus negative cases (same type, `string`/`binary`, and unrelated integer-width mismatches). - `test_create_arrow_array_casts_large_string_on_old_pyarrow` and `test_create_arrow_array_casts_large_binary_on_old_pyarrow` simulate the pyarrow < 19 protocol behavior (a `large_*` result is returned for a requested `string`/`binary`) and assert the result is cast back and the data is preserved. - `test_create_arrow_array_no_cast_on_new_pyarrow` and `test_create_arrow_array_no_cast_on_unrelated_mismatch` assert the fix does not over-reach on pyarrow >= 19 or on unrelated (e.g. int32 vs int64) mismatches. The two casting tests fail without the fix (the corrupt `large_*` type leaks through); all tests pass with it. The full `test_conversion.py` suite passes (40 tests). ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
