etseidl commented on issue #8643: URL: https://github.com/apache/arrow-rs/issues/8643#issuecomment-3481980949
One thing I've noticed now that I'm far along on #8763 is that it will take a bit of retooling the arrow reader API to reduce unnecessary metadata decoding. Take the example of wanting to create an arrow reader with a column projection. ```rust let mask = ProjectionMask::leaves(builder.parquet_schema(), [2]); let builder = ParquetRecordBatchReaderBuilder::try_new(file).with_projection(mask); builder.build()? ``` The `try_new` calls `try_new_with_options` using default options, which will read the file metadata in its entirety before calling `Self::new_with_metadata`. https://github.com/apache/arrow-rs/blob/b8a192696b8216878f247b6a2d8dd63a09558063/parquet/src/arrow/arrow_reader/mod.rs#L744-L752 Only after we already have the parquet metadata in hand will we actually set the projection mask on the builder. To get around this, we'll need the projection mask set in the `ArrowReaderOptions`, which really means in its contained `MetadataOptions`. Not a show stopper, but to avoid breaking changes we'll need to be careful. I'd imagine many of the options on the `ArrowReaderBuilder` (projection, row_groups, etc) would need to move to the metadata options, but be left in the current API with deprecation notices and a warning that setting such options with both the `ArrowReaderBuilder` and `MetadataOptions` may lead to UB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
