etseidl commented on issue #8643:
URL: https://github.com/apache/arrow-rs/issues/8643#issuecomment-3481980949

   One thing I've noticed now that I'm far along on #8763 is that it will take 
a bit of retooling the arrow reader API to reduce unnecessary metadata 
decoding. Take the example of wanting to create an arrow reader with a column 
projection.
   
   ```rust
   let mask = ProjectionMask::leaves(builder.parquet_schema(), [2]);
   let builder = 
ParquetRecordBatchReaderBuilder::try_new(file).with_projection(mask);
   builder.build()?
   ```
   
   The `try_new` calls `try_new_with_options` using default options, which will 
read the file metadata in its entirety before calling 
`Self::new_with_metadata`. 
https://github.com/apache/arrow-rs/blob/b8a192696b8216878f247b6a2d8dd63a09558063/parquet/src/arrow/arrow_reader/mod.rs#L744-L752
   
   Only after we already have the parquet metadata in hand will we actually set 
the projection mask on the builder. To get around this, we'll need the 
projection mask set in the `ArrowReaderOptions`, which really means in its 
contained `MetadataOptions`.
   
   Not a show stopper, but to avoid breaking changes we'll need to be careful. 
I'd imagine many of the options on the `ArrowReaderBuilder` (projection, 
row_groups, etc) would need to move to the metadata options, but be left in the 
current API with deprecation notices and a warning that setting such options 
with both the `ArrowReaderBuilder` and `MetadataOptions` may lead to UB.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to