alamb commented on a change in pull request #1389:
URL: https://github.com/apache/arrow-rs/pull/1389#discussion_r818965246
##########
File path: parquet/src/file/serialized_reader.rs
##########
@@ -138,25 +138,51 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
})
}
- /// Filters row group metadata to only those row groups,
- /// for which the predicate function returns true
+ /// Filter row groups by metadata that match the predicate criteria and
row group's midpoint
+ /// are within the `[start, end)` range (if the range is provided).
pub fn filter_row_groups(
&mut self,
predicate: &dyn Fn(&RowGroupMetaData, usize) -> bool,
+ range: Option<(i64, i64)>,
Review comment:
I am a little confused about this API.
The units of`range` appear to be `bytes` and the range is then compared with
compressed size in bytes.
I assume this is so that the parquet file can be divided into evenly sized
pieces, which makes sense given the comments on
https://github.com/blaze-init/blaze-rs/issues/36
However, I think this API may be confusing to most other users. What if we
changed the signature of `filter_row_groups` to take a `FnMut` like
```rust
predicate: &dyn Fn(&RowGroupMetaData, usize) -> bool,
```
to
```rust
predicate: &dyn FnMut(&RowGroupMetaData, usize) -> bool,
```
I think if that was done, it would be possible to implement the logic you
have below to only pick the groups in a range with a closure that had state.
Something like (untested):
```rust
let mut stop = false;
reader.filter_row_groups(|row_group_meta| {
if stop {
return false;
}
let mid = get_midpoint_offset(row_group_metadata);
if mid >= range.1 {
stop = true;
return false
}
if mid >= range.0 && predicate(row_group_metadata, i) {
true
} else {
false
}
});
```
?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]