Re: [PR] feat: Add RecordBatchOptions::skip_schema_check option [arrow-rs]

via GitHub Mon, 09 Dec 2024 07:52:44 -0800


tustvold commented on PR #6855:
URL: https://github.com/apache/arrow-rs/pull/6855#issuecomment-2528255451


   > This is DataFusion limitation rather than an Arrow limitation, but 
DataFusion uses Arrow's RecordBatch.
   > It would be nice eventually if DataFusion would just require the logical 
schema to be the same for all batches but allow differences in the physical 
type.
   
   I think this is key issue, the schema of the RecordBatch is the physical 
type. Arrow has no notion of a logical type, nor realistically can it when what 
this looks like is so use-case specific, are Int32 and Int64 the same logical 
type, what about differing decimal precisions?
   
   Ultimately as the schema cannot vary within a single RecordBatch, the onus 
is on whatever is the origin of the inter-RecordBatch constraint to make a 
judgement on whether they accept heterogenous inputs. This PR is effectively 
breaking a fairly fundamental invariant of RecordBatch to bypass checks in 
other components that are either necessary because the component relies on 
them, or unnecessary and therefore could/should just be removed.
   
   Or to phrase it differently, I can't see what correct usage there could be 
of this API that isn't just working around an over-zealous constraint in some 
unrelated system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Add RecordBatchOptions::skip_schema_check option [arrow-rs]

Reply via email to