Re: [I] Inconsistencies between `RecordBatch` and `DataFrame` schemas cause `to_arrow_table` to fail [datafusion-python]

via GitHub Thu, 04 Dec 2025 00:40:51 -0800


IshaGudewar commented on issue #1314:
URL: 
https://github.com/apache/datafusion-python/issues/1314#issuecomment-3610901767


   Thanks for opening this issue and the fix!
   I reviewed the problem and the proposed solution in PR #1315, and I can 
confirm that the root cause is the schema mismatch between:
   
   the DataFrame schema (which may mark fields as NOT NULL), and
   the RecordBatch schema (which often marks aggregated columns as nullable).
   
   PyArrow requires all RecordBatch schemas to match exactly, so this mismatch 
causes:
   
   ArrowInvalid: Schema at index 0 was different
   
   
   
   To help strengthen the PR, I will add:
   
   1. A regression test for the zero-record-batch case, verifying that when no 
batches are returned, to_arrow_table() still uses the DataFrame schema.
   
   2. A test for an aggregation that returns NULL, such as max(a) on an empty 
table, confirming the output is [None] and schema nullability is handled 
correctly.
   
   I will contribute these tests to PR #1315 shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Inconsistencies between `RecordBatch` and `DataFrame` schemas cause `to_arrow_table` to fail [datafusion-python]

Reply via email to