yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551583777



##########
File path: rust/parquet/src/file/serialized_reader.rs
##########
@@ -137,6 +137,22 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
             metadata,
         })
     }
+
+    pub fn filter_row_groups(

Review comment:
       Good point about documentation - will add some shortly. 
   
   As long as row group metadata is filtered immediately after creating a 
SerializedFileReader, this approach will work.
   
   That's the simplest way I could think of to allow filtering of row groups 
using statistics metadata; not sure how this could be done within DataFusion 
itself, because it reads data in batches (of configurable size) which could 
potentially span multiple row groups; it could be done, but would probably move 
a lot of complexity into DataFusion which today is nicely abstracted into the 
parquet library. This would also expose a lot more about the internals of a 
parquet file format to the outside as the user would have to be aware of row 
groups rather than just requesting batches of data.
   May be I misunderstand what you are suggesting?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to