sunchao commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551680435
##########
File path: rust/parquet/src/file/serialized_reader.rs
##########
@@ -137,6 +137,22 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
metadata,
})
}
+
+ pub fn filter_row_groups(
Review comment:
What I was thinking is that, we can have another constructor for
`SerializedFileReader` which takes a custom metadata:
```rust
pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) ->
Result<Self> {
Ok(Self {
chunk_reader: Arc::new(chunk_reader),
metadata: metadata,
})
}
```
and we move the metadata filtering part to data fusion, or a util function
in `footer.rs`.
In the long term though, I think we should do something similar to
[parquet-mr is
doing](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L656),
that is, having a `ParquetReadOptions`-like struct which allows user to
specify various configs, properties, filters, etc when reading a parquet file.
The struct is extendable as well to accommodate new features in future such as
filtering with column indexes or bloom filters. The constructor can become like
this:
```rust
pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result<Self>
{
Ok(Self {
chunk_reader: Arc::new(chunk_reader),
options: options,
})
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]