[PR] [SPARK-55583][PYTHON] Validate Arrow schema types in Python data source RecordBatch path [spark]

via GitHub Tue, 17 Feb 2026 22:42:03 -0800


Yicong-Huang opened a new pull request, #54362:
URL: https://github.com/apache/spark/pull/54362


   ### What changes were proposed in this pull request?
   
   This PR adds Arrow schema type validation for the `pa.RecordBatch` code path 
in Python data source reads. Previously, only column count and column names 
were validated. If a data source returned a `RecordBatch` with correct column 
names but mismatched data types (e.g., returning `int32` when the schema 
declares `string`), the mismatch was not caught in Python and would propagate 
to the JVM, resulting in a cryptic `IllegalArgumentException` from Arrow's 
`VectorLoader` ("not all nodes, buffers and variadicBufferCounts were 
consumed") or an `UnsupportedOperationException` from `ArrowColumnVector`.
   
   The fix adds a `pa_schema.equals(first_element.schema)` check after the 
existing column name validation in `records_to_arrow_batches()`, raising a 
clear `DATA_SOURCE_RETURN_SCHEMA_MISMATCH` error with the expected and actual 
Arrow schemas.
   
   ### Why are the changes needed?
   
   When a Python data source returns a `pa.RecordBatch` with data types that 
don't match the declared schema, the resulting JVM-side errors are confusing 
and do not indicate the root cause. For example:
   
   - `IllegalArgumentException: not all nodes, buffers and variadicBufferCounts 
were consumed` from `VectorLoader.load()`
   - `UnsupportedOperationException: Cannot call the method "getUTF8String" of 
ArrowColumnVector$ArrowVectorAccessor`
   
   These errors give no indication that the issue is a schema type mismatch in 
the Python data source's `read()` method. By validating the Arrow schema types 
on the Python side before sending data to the JVM, we provide a clear, 
actionable error message.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Previously, returning a `pa.RecordBatch` with mismatched types from a 
Python data source would result in cryptic JVM errors. Now it raises a clear 
`DATA_SOURCE_RETURN_SCHEMA_MISMATCH` error showing the expected and actual 
Arrow schemas.
   
   ### How was this patch tested?
   
   Added a test case in 
`test_python_datasource.py::test_arrow_batch_data_source` that creates a 
`MismatchedTypeDataSource` declaring schema `"key string, value string"` but 
returning a `RecordBatch` with `(int32, string)` types, and verifies the 
`DATA_SOURCE_RETURN_SCHEMA_MISMATCH` error is raised.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55583][PYTHON] Validate Arrow schema types in Python data source RecordBatch path [spark]

Reply via email to