alamb commented on code in PR #10158:
URL: https://github.com/apache/arrow-rs/pull/10158#discussion_r3454695210
##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -538,6 +538,31 @@ impl ParquetPushDecoder {
self.state.row_groups_remaining()
}
+ /// Returns the file-level row-group index that the next call to
Review Comment:
I think `file-level` is somewhat redundant here -- the PushDecoder is always
in teh context of a single file
```suggestion
/// Returns the row-group index that the next call to
```
##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -538,6 +538,31 @@ impl ParquetPushDecoder {
self.state.row_groups_remaining()
}
+ /// Returns the file-level row-group index that the next call to
+ /// [`Self::try_next_reader`] will yield a reader for, after applying
+ /// any internal skipping (row selection emptiness, exhausted budget,
+ /// finished state). Returns `Ok(None)` when:
Review Comment:
```suggestion
/// finished state).
///
/// Returns `Ok(None)` when:
```
##########
parquet/src/arrow/push_decoder/remaining.rs:
##########
@@ -93,6 +93,63 @@ impl RowGroupFrontier {
self.budget = budget;
}
+ /// Peek at the next row-group index `next_readable_row_group` would
+ /// hand out, without mutating any state. Returns `None` if every
+ /// remaining row group would be skipped under the current
+ /// selection/budget, or if the queue is empty.
+ ///
+ /// Mirrors the structure of `next_readable_row_group` but only walks
+ /// borrowed state — used by
[`super::ParquetPushDecoder::peek_next_row_group`]
+ /// to let adaptive callers (e.g. dynamic row-group pruners or per-RG
+ /// `RowFilter` toggles) keep their per-RG state in lock-step with
+ /// the reader the decoder is about to emit.
+ fn peek_next_row_group(&self) -> Result<Option<usize>, ParquetError> {
+ // Short-circuit: budget exhausted or selection drained ⇒ same
+ // outcome as `next_readable_row_group`'s early return.
+ if self.budget.is_exhausted()
+ || self
+ .selection
+ .as_ref()
+ .is_some_and(|selection| selection.row_count() == 0)
+ {
+ return Ok(None);
+ }
+
+ // We may have to walk past row groups whose split selection is
+ // empty. Cloning the selection lets us run the same `split_off`
+ // logic without disturbing the real one.
+ let mut selection = self.selection.clone();
Review Comment:
😬 it would be nice to avoid cloning the selectin
##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -538,6 +538,31 @@ impl ParquetPushDecoder {
self.state.row_groups_remaining()
}
+ /// Returns the file-level row-group index that the next call to
+ /// [`Self::try_next_reader`] will yield a reader for, after applying
+ /// any internal skipping (row selection emptiness, exhausted budget,
+ /// finished state). Returns `Ok(None)` when:
+ /// - the decoder has no more row groups to read,
+ /// - the decoder is currently inside a row group (consumers should
+ /// call [`Self::is_at_row_group_boundary`] first), or
+ /// - every remaining row group would be skipped.
+ ///
+ /// Returns `Err` when reading row-group metadata fails (e.g.
+ /// `usize` overflow on 32-bit targets), matching the error surface
+ /// of `try_next_reader` so peek and read paths report errors
+ /// consistently.
+ ///
+ /// This is a read-only peek: it does not mutate decoder state. It is
+ /// useful for adaptive callers that maintain per-row-group state in
+ /// lock-step with the decoder (e.g. dynamic row-group pruners or
+ /// per-RG `RowFilter` toggles): without this peek the caller has no
+ /// way to know which row group the next reader actually corresponds
+ /// to, because [`Self::try_next_reader`] may silently advance past
+ /// row groups whose row selection is empty.
Review Comment:
Here is a suggestion to make this more concise for your consideration
```suggestion
/// This method not mutate decoder state. It is
/// useful for callers that maintain per-row-group state in
/// lock-step with the decoder (e.g. dynamic row-group pruners)
/// to determine which row group the next reader corresponds
/// to as [`Self::try_next_reader`] may silently advance past
/// row groups based on filtering and other criteria
```
##########
parquet/src/arrow/push_decoder/remaining.rs:
##########
@@ -93,6 +93,63 @@ impl RowGroupFrontier {
self.budget = budget;
}
+ /// Peek at the next row-group index `next_readable_row_group` would
+ /// hand out, without mutating any state. Returns `None` if every
+ /// remaining row group would be skipped under the current
+ /// selection/budget, or if the queue is empty.
+ ///
+ /// Mirrors the structure of `next_readable_row_group` but only walks
Review Comment:
is it possible to avoid the duplication with `next_readable_row_group`? I
worry the two could drift out of sync
##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -538,6 +538,31 @@ impl ParquetPushDecoder {
self.state.row_groups_remaining()
}
+ /// Returns the file-level row-group index that the next call to
+ /// [`Self::try_next_reader`] will yield a reader for, after applying
+ /// any internal skipping (row selection emptiness, exhausted budget,
+ /// finished state). Returns `Ok(None)` when:
+ /// - the decoder has no more row groups to read,
+ /// - the decoder is currently inside a row group (consumers should
+ /// call [`Self::is_at_row_group_boundary`] first), or
+ /// - every remaining row group would be skipped.
+ ///
+ /// Returns `Err` when reading row-group metadata fails (e.g.
+ /// `usize` overflow on 32-bit targets), matching the error surface
+ /// of `try_next_reader` so peek and read paths report errors
+ /// consistently.
Review Comment:
This seems overly detailed (it probably isn't necessary to say that reading
will fail when there is an overflo
```suggestion
```
##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -840,6 +865,19 @@ impl ParquetDecoderState {
}
}
+ fn peek_next_row_group(&self) -> Result<Option<usize>, ParquetError> {
Review Comment:
maybe you could add a comment with a link to
ParquetPushDecoder::peek_next_row_group?
##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -538,6 +538,31 @@ impl ParquetPushDecoder {
self.state.row_groups_remaining()
}
+ /// Returns the file-level row-group index that the next call to
+ /// [`Self::try_next_reader`] will yield a reader for, after applying
+ /// any internal skipping (row selection emptiness, exhausted budget,
+ /// finished state). Returns `Ok(None)` when:
+ /// - the decoder has no more row groups to read,
+ /// - the decoder is currently inside a row group (consumers should
+ /// call [`Self::is_at_row_group_boundary`] first), or
+ /// - every remaining row group would be skipped.
+ ///
+ /// Returns `Err` when reading row-group metadata fails (e.g.
+ /// `usize` overflow on 32-bit targets), matching the error surface
+ /// of `try_next_reader` so peek and read paths report errors
+ /// consistently.
+ ///
+ /// This is a read-only peek: it does not mutate decoder state. It is
+ /// useful for adaptive callers that maintain per-row-group state in
+ /// lock-step with the decoder (e.g. dynamic row-group pruners or
+ /// per-RG `RowFilter` toggles): without this peek the caller has no
+ /// way to know which row group the next reader actually corresponds
+ /// to, because [`Self::try_next_reader`] may silently advance past
+ /// row groups whose row selection is empty.
+ pub fn peek_next_row_group(&self) -> Result<Option<usize>, ParquetError> {
+ self.state.peek_next_row_group()
+ }
+
Review Comment:
Are you suggesting it actually return more than one row group? Something
like
```rust
pub fn peek_next_row_groups(&self) -> Result<Option<&[usize]>,
ParquetError> {
```
or just rename it?
##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -840,6 +865,19 @@ impl ParquetDecoderState {
}
}
+ fn peek_next_row_group(&self) -> Result<Option<usize>, ParquetError> {
+ match self {
+ ParquetDecoderState::ReadingRowGroup {
+ remaining_row_groups,
+ } => remaining_row_groups.peek_next_row_group(),
+ // We only expose a meaningful answer at row-group boundaries.
+ // Mid-row-group there is no "next" — the active reader is
+ // tied to the current row group.
+ ParquetDecoderState::DecodingRowGroup { .. } => Ok(None),
Review Comment:
I think there is a well defined next row group even in the middle of reading
a row group. Why not return the remaining row groups here? Something like
```rust
ParquetDecoderState::DecodingRowGroup {
remaining_row_groups ,
..
} => remaining_row_groups.peek_next_row_group(),
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]