alamb opened a new issue, #9929: URL: https://github.com/apache/arrow-datafusion/issues/9929
### Is your feature request related to a problem or challenge? We are building / testing a specialized index for data stored in parquet that can tell us what row offsets are needed from the parquet file based on additional infomration Currently the parquet-rs parquet reader allows specifying this type of information via [`ArrowReaderBuilder::with_row_selection`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection) However, the DataFusion [`ParquetExec`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html) has no way to pass this information down. It does build its own ### Describe the solution you'd like What I would like is a way to provide something like a [`RowSelection`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html) for each row group ### Describe alternatives you've considered Here is one possible API: ```rust let parquet_selection = ParquetSelection::new() // * rows 100-250 from row group 1 .select(1, RowSelection::from(vec![ RowSelector::skip(100), RowSelector::select(150) ]); // * rows 50-100 and 200-300 in row group 2 .select(2, RowSelection::from(vec![ RowSelector::skip(50), RowSelector::select(50), RowSelector::skip(100), RowSelector::select(100), ]); let parquet_exec = ParquetExec::new(...) .with_selection(parquet_selection); ``` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
