alamb commented on code in PR #10158:
URL: https://github.com/apache/arrow-rs/pull/10158#discussion_r3481220941


##########
parquet/src/arrow/push_decoder/remaining.rs:
##########
@@ -37,6 +37,58 @@ enum QueuedRowGroupDecision {
     Skip { remaining_budget: RowBudget },
 }
 
+/// Borrowed cursor over a [`RowSelection`] that counts selected rows in
+/// each row group's slice without mutating the selection.

Review Comment:
   this is a neat idea, but the more I see the more I think it may be overly 
fancy / premature optimization



##########
parquet/src/arrow/push_decoder/remaining.rs:
##########
@@ -93,6 +145,62 @@ impl RowGroupFrontier {
         self.budget = budget;
     }
 
+    /// Peek at the next row-group index `next_readable_row_group` would
+    /// hand out, without mutating any state. Returns `None` if every
+    /// remaining row group would be skipped under the current
+    /// selection/budget, or if the queue is empty.
+    ///
+    /// Walks the queued frontier via [`PeekSelectionCursor`] so the
+    /// real `RowSelection` is not cloned. The Read/Skip rule inlined
+    /// below is intentionally kept in lock-step with
+    /// [`Self::plan_selected_row_group`]; both touch a small enough
+    /// set of decisions that a shared helper would obscure more than
+    /// it saves. The `peek_matches_next_readable_first_hit` test
+    /// asserts the lock-step on the head element across a range of
+    /// inputs.
+    fn peek_next_row_group(&self) -> Result<Option<usize>, ParquetError> {
+        // Short-circuit: budget exhausted or selection drained ⇒ same
+        // outcome as `next_readable_row_group`'s early return.
+        if self.budget.is_exhausted()

Review Comment:
   I worry that this is very similar to next_readeable_row_group and that the 
implementations can drift out of sync over time
   
   
https://github.com/apache/arrow-rs/blob/fe0b993a14ddb963dcf6698ec4d90d9ec1911483/parquet/src/arrow/push_decoder/remaining.rs#L126-L125



##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -538,6 +538,31 @@ impl ParquetPushDecoder {
         self.state.row_groups_remaining()
     }
 
+    /// Returns the file-level row-group index that the next call to
+    /// [`Self::try_next_reader`] will yield a reader for, after applying
+    /// any internal skipping (row selection emptiness, exhausted budget,
+    /// finished state). Returns `Ok(None)` when:
+    /// - the decoder has no more row groups to read,
+    /// - the decoder is currently inside a row group (consumers should
+    ///   call [`Self::is_at_row_group_boundary`] first), or
+    /// - every remaining row group would be skipped.
+    ///
+    /// Returns `Err` when reading row-group metadata fails (e.g.
+    /// `usize` overflow on 32-bit targets), matching the error surface
+    /// of `try_next_reader` so peek and read paths report errors
+    /// consistently.
+    ///
+    /// This is a read-only peek: it does not mutate decoder state. It is
+    /// useful for adaptive callers that maintain per-row-group state in
+    /// lock-step with the decoder (e.g. dynamic row-group pruners or
+    /// per-RG `RowFilter` toggles): without this peek the caller has no
+    /// way to know which row group the next reader actually corresponds
+    /// to, because [`Self::try_next_reader`] may silently advance past
+    /// row groups whose row selection is empty.
+    pub fn peek_next_row_group(&self) -> Result<Option<usize>, ParquetError> {
+        self.state.peek_next_row_group()
+    }
+

Review Comment:
   Let's file a follow on ticket to consider adding additional APIs that will 
expose more information. For now, the next row group is a step forward



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to