alamb commented on a change in pull request #1389:
URL: https://github.com/apache/arrow-rs/pull/1389#discussion_r818965246



##########
File path: parquet/src/file/serialized_reader.rs
##########
@@ -138,25 +138,51 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
         })
     }
 
-    /// Filters row group metadata to only those row groups,
-    /// for which the predicate function returns true
+    /// Filter row groups by metadata that match the predicate criteria and 
row group's midpoint
+    /// are within the `[start, end)` range (if the range is provided).
     pub fn filter_row_groups(
         &mut self,
         predicate: &dyn Fn(&RowGroupMetaData, usize) -> bool,
+        range: Option<(i64, i64)>,

Review comment:
       I am a little confused about this API. 
   
   The units of`range` appear to be `bytes` and the range is then compared with 
compressed size in bytes.
   
   I assume this is so that the parquet file can be divided into evenly sized 
pieces, which makes sense given the comments on 
https://github.com/blaze-init/blaze-rs/issues/36
   
   However, I think this API  may be confusing to most other users.  What if we 
changed the signature of `filter_row_groups` to take a `FnMut` like
   
   ```rust
           predicate: &dyn Fn(&RowGroupMetaData, usize) -> bool,
   ```
   
   to 
   
   ```rust
           predicate: &dyn FnMut(&RowGroupMetaData, usize) -> bool,
   ```
   
   I think if that was done, it would be possible to implement the logic you 
have below to only pick the groups in a range with a closure that had state. 
Something like (untested):
   
   ```rust
   let mut stop = false;
   
   reader.filter_row_groups(|row_group_meta| {
     if stop { 
       return false;
     }
   
     let mid = get_midpoint_offset(row_group_metadata);
      if mid >= range.1 {
        stop = true;
        return false
      }
   
       if mid >= range.0 && predicate(row_group_metadata, i) {
          true
       } else {
         false
       }
   });
   ```
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to