Yicong-Huang opened a new pull request, #54362:
URL: https://github.com/apache/spark/pull/54362
### What changes were proposed in this pull request?
This PR adds Arrow schema type validation for the `pa.RecordBatch` code path
in Python data source reads. Previously, only column count and column names
were validated. If a data source returned a `RecordBatch` with correct column
names but mismatched data types (e.g., returning `int32` when the schema
declares `string`), the mismatch was not caught in Python and would propagate
to the JVM, resulting in a cryptic `IllegalArgumentException` from Arrow's
`VectorLoader` ("not all nodes, buffers and variadicBufferCounts were
consumed") or an `UnsupportedOperationException` from `ArrowColumnVector`.
The fix adds a `pa_schema.equals(first_element.schema)` check after the
existing column name validation in `records_to_arrow_batches()`, raising a
clear `DATA_SOURCE_RETURN_SCHEMA_MISMATCH` error with the expected and actual
Arrow schemas.
### Why are the changes needed?
When a Python data source returns a `pa.RecordBatch` with data types that
don't match the declared schema, the resulting JVM-side errors are confusing
and do not indicate the root cause. For example:
- `IllegalArgumentException: not all nodes, buffers and variadicBufferCounts
were consumed` from `VectorLoader.load()`
- `UnsupportedOperationException: Cannot call the method "getUTF8String" of
ArrowColumnVector$ArrowVectorAccessor`
These errors give no indication that the issue is a schema type mismatch in
the Python data source's `read()` method. By validating the Arrow schema types
on the Python side before sending data to the JVM, we provide a clear,
actionable error message.
### Does this PR introduce _any_ user-facing change?
Yes. Previously, returning a `pa.RecordBatch` with mismatched types from a
Python data source would result in cryptic JVM errors. Now it raises a clear
`DATA_SOURCE_RETURN_SCHEMA_MISMATCH` error showing the expected and actual
Arrow schemas.
### How was this patch tested?
Added a test case in
`test_python_datasource.py::test_arrow_batch_data_source` that creates a
`MismatchedTypeDataSource` declaring schema `"key string, value string"` but
returning a `RecordBatch` with `(int32, string)` types, and verifies the
`DATA_SOURCE_RETURN_SCHEMA_MISMATCH` error is raised.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]