andygrove commented on PR #6855: URL: https://github.com/apache/arrow-rs/pull/6855#issuecomment-2528228872
> I'm really unsure about this as it will break things in unexpected ways, lots of codepaths assume the schema is correct, what is the motivation for having RecordBatch with the same but incorrect schema? Why does the schema need to be the same? The motivation is that when reading Parquet files for one table, the physical type is not the same for all batches because sometimes a column is dictionary-encoded and sometimes it is not. DataFusion requires that each operator has a single fixed schema for all batches, so we currently have to coerce all batches into the same schema. This is DataFusion limitation rather than an Arrow limitation, but DataFusion uses Arrow's RecordBatch. It would be nice eventually if DataFusion would just require the logical schema to be the same for all batches but allow differences in the physical type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
