sunchao commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551680435



##########
File path: rust/parquet/src/file/serialized_reader.rs
##########
@@ -137,6 +137,22 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
             metadata,
         })
     }
+
+    pub fn filter_row_groups(

Review comment:
       What I was thinking is that, we can have another constructor for 
`SerializedFileReader` which takes a custom metadata:
   ```rust
       pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) -> 
Result<Self> {
           Ok(Self {
               chunk_reader: Arc::new(chunk_reader),
               metadata: metadata,
           })
       }
   ```
   and we move the metadata filtering part to data fusion, or a util function 
in `footer.rs`.
   
   In the long term though, I think we should do something similar to 
[parquet-mr is 
doing](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L656),
 that is, having a `ParquetReadOptions`-like struct which allows user to 
specify various configs, properties, filters, etc when reading a parquet file. 
The struct is extendable as well to accommodate new features in future such as 
filtering with column indexes or bloom filters. The constructor can become like 
this:
   ```rust
       pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result<Self> 
{
           Ok(Self {
               chunk_reader: Arc::new(chunk_reader),
               options: options,
           })
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to