Yicong-Huang opened a new pull request, #56380:
URL: https://github.com/apache/spark/pull/56380

   ### What changes were proposed in this pull request?
   
   This is a follow-up to #56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] 
Support pa.ChunkedArray columns in createDataFrame from pandas`), which 
introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`).
   
   In `python/pyspark/sql/pandas/conversion.py`, after 
`pa.Array.from_pandas(...)`, this PR adds a narrowly-scoped conditional cast 
back to the requested Arrow type. The cast only fires when all of the following 
hold: the requested `arrow_type` is not `None`, the installed pyarrow is older 
than 19.0.0, the produced array type does not already equal `arrow_type`, and 
the mismatch is exactly an offset-width difference between 
`string`/`large_string` or `binary`/`large_binary` (detected by the new 
`_is_string_or_binary_width_mismatch` helper). For any other version or type 
mismatch the result is returned untouched.
   
   ### Why are the changes needed?
   
   Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by 
`large_string` (64-bit offsets). When such a series is converted via the 
`__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` 
argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a 
`large_string` (or `large_binary`) array even though `string` (or `binary`) was 
requested. The JVM then reads the 64-bit offset buffers against the 32-bit 
schema, silently corrupting the data.
   
   pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by 
casting the result to the requested type. This PR replicates that behavior for 
older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces 
correct data instead of corrupt rows.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing 
`string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) 
previously produced silently corrupted data; it now produces correct data. On 
pyarrow >= 19.0.0 there is no behavior change.
   
   ### How was this patch tested?
   
   Added unit tests in `python/pyspark/sql/tests/test_conversion.py`:
   
   - `test_string_or_binary_width_mismatch_helper` covers the 
`_is_string_or_binary_width_mismatch` helper across both offset-width 
directions for string and binary, plus negative cases (same type, 
`string`/`binary`, and unrelated integer-width mismatches).
   - `test_create_arrow_array_casts_large_string_on_old_pyarrow` and 
`test_create_arrow_array_casts_large_binary_on_old_pyarrow` simulate the 
pyarrow < 19 protocol behavior (a `large_*` result is returned for a requested 
`string`/`binary`) and assert the result is cast back and the data is preserved.
   - `test_create_arrow_array_no_cast_on_new_pyarrow` and 
`test_create_arrow_array_no_cast_on_unrelated_mismatch` assert the fix does not 
over-reach on pyarrow >= 19 or on unrelated (e.g. int32 vs int64) mismatches.
   
   The two casting tests fail without the fix (the corrupt `large_*` type leaks 
through); all tests pass with it. The full `test_conversion.py` suite passes 
(40 tests).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to