sunchao commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551680435
########## File path: rust/parquet/src/file/serialized_reader.rs ########## @@ -137,6 +137,22 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> { metadata, }) } + + pub fn filter_row_groups( Review comment: What I was thinking is that, we can have another constructor for `SerializedFileReader` which takes a custom metadata: ```rust pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) -> Result<Self> { Ok(Self { chunk_reader: Arc::new(chunk_reader), metadata: metadata, }) } ``` and we move the metadata filtering part to data fusion, or a util function in `footer.rs`. In the long term though, I think we should do something similar to [parquet-mr is doing](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L656), that is, having a `ParquetReadOptions`-like struct which allows user to specify various configs, properties, filters, etc when reading a parquet file. The struct is extendable as well to accommodate new features in future such as filtering with column indexes or bloom filters. The constructor can become like this: ```rust pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result<Self> { Ok(Self { chunk_reader: Arc::new(chunk_reader), options: options, }) } ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org