Re: [PR] feat: Add RecordBatchOptions::skip_schema_check option [arrow-rs]

via GitHub Mon, 09 Dec 2024 06:53:56 -0800


andygrove commented on PR #6855:
URL: https://github.com/apache/arrow-rs/pull/6855#issuecomment-2528228872


   > I'm really unsure about this as it will break things in unexpected ways, 
lots of codepaths assume the schema is correct, what is the motivation for 
having RecordBatch with the same but incorrect schema? Why does the schema need 
to be the same?
   
   The motivation is that when reading Parquet files for one table, the 
physical type is not the same for all batches because sometimes a column is 
dictionary-encoded and sometimes it is not.
   
   DataFusion requires that each operator has a single fixed schema for all 
batches, so we currently have to coerce all batches into the same schema. This 
is DataFusion limitation rather than an Arrow limitation, but DataFusion uses 
Arrow's RecordBatch.
   
   It would be nice eventually if DataFusion would just require the logical 
schema to be the same for all batches but allow differences in the physical 
type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Add RecordBatchOptions::skip_schema_check option [arrow-rs]

Reply via email to