etseidl opened a new issue, #8643:
URL: https://github.com/apache/arrow-rs/issues/8643

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   - Part of #5853
   
   One of the goals of the Thrift remodel project (#5854) was to enable such 
things as selective decoding of parts of the Parquet metadata. The parsers are 
now in place to enable this, but was is lacking now is a way to communicate 
what bits of the metadata are required.
   
   **Describe the solution you'd like**
   Some mechanism to communicate to the metadata parsers what is needed. 
Options can include such things as:
   - Skip some statistics  fields in `ColumnMetaData` (`Statistics`, 
`PageEncodingStatistics`, `SizeStatistics`, etc).
   - Parse page encoding statistics into some other form (boolean, bitmask) to 
support dictionary based pushdown.
   - Column projections (i.e. skip decoding metadata for columns that will not 
be read).
   - Row group selection (only parse metadata for requested set of row groups).
   - Only return schema.
   - Skip schema and use a provided schema (perhaps from an earlier decode).
   - Perhaps move encryption parameters here as well.
   - Others I haven't yet thought of.
   
   
   **Describe alternatives you've considered**
   These options could be added to current properties objects, but there 
doesn't seem to b a single place for all of these. For instance, 
`SerializedFileReader` takes a `ReadOptions`, that contains a 
`ReaderProperties` which is what is subsequently used by the 
`SerialzedRowGroupReader` and children. On the arrow side we instead use an 
`ArrowReaderOptions`. The `ParquetMetaDataReader` and 
`ParquetMetaDataPushDecoder` manage their own set of options. It would be nice 
to have a single place to set metadata parsing options and then pass that to 
the respective decoders.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to