IshaGudewar commented on issue #1314: URL: https://github.com/apache/datafusion-python/issues/1314#issuecomment-3610901767
Thanks for opening this issue and the fix! I reviewed the problem and the proposed solution in PR #1315, and I can confirm that the root cause is the schema mismatch between: the DataFrame schema (which may mark fields as NOT NULL), and the RecordBatch schema (which often marks aggregated columns as nullable). PyArrow requires all RecordBatch schemas to match exactly, so this mismatch causes: ArrowInvalid: Schema at index 0 was different To help strengthen the PR, I will add: 1. A regression test for the zero-record-batch case, verifying that when no batches are returned, to_arrow_table() still uses the DataFrame schema. 2. A test for an aggregation that returns NULL, such as max(a) on an empty table, confirming the output is [None] and schema nullability is handled correctly. I will contribute these tests to PR #1315 shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
