Re: [PR] Allow swapping projections / row filters at row group boundaries in Parquet reader [arrow-rs]

via GitHub Fri, 15 May 2026 07:10:33 -0700


alamb commented on code in PR #9968:
URL: https://github.com/apache/arrow-rs/pull/9968#discussion_r3248692408



##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -377,6 +379,100 @@ impl ParquetPushDecoder {
     pub fn clear_all_ranges(&mut self) {
         self.state.clear_all_ranges();
     }
+
+    /// True iff the decoder is at a row-group boundary and a
+    /// [`Self::swap_strategy`] call would succeed.
+    ///
+    /// A boundary is "between row groups": the previous row group's
+    /// [`ParquetRecordBatchReader`] has been fully extracted (via
+    /// [`Self::try_next_reader`]) or fully drained (via [`Self::try_decode`]),
+    /// and the next row group has not yet been planned. While
+    /// [`Self::try_decode`] is iterating an active row group's reader this
+    /// returns `false`; with [`Self::try_next_reader`] there is a clean
+    /// window between two consecutive returns where this is `true`.
+    pub fn can_swap_strategy(&self) -> bool {
+        self.state.can_swap_strategy()
+    }
+
+    /// Number of row groups left to decode after the one currently in flight.
+    /// Useful as a "should I bother computing a new strategy?" signal.
+    pub fn row_groups_remaining(&self) -> usize {
+        self.state.row_groups_remaining()
+    }
+
+    /// Replace projection / row filter / row selection policy for subsequent
+    /// row groups.
+    ///
+    /// Returns `Err(ParquetError::General)` when called outside a row-group
+    /// boundary; check [`Self::can_swap_strategy`] first.
+    ///
+    /// The decoder's internal `PushBuffers` are preserved across the swap.
+    /// Bytes that have already been fetched for columns the new strategy
+    /// still needs will not be re-requested. Bytes for columns the new
+    /// strategy no longer needs remain buffered until
+    /// [`Self::clear_all_ranges`] is called or the decoder is dropped.
+    ///
+    /// Limit, offset, batch size, metadata, fields, and predicate-cache size
+    /// are unchanged by a swap.
+    pub fn swap_strategy(&mut self, swap: StrategySwap) -> Result<(), 
ParquetError> {
+        self.state.swap_strategy(swap)
+    }
+}
+
+/// Description of a strategy swap to apply via
+/// [`ParquetPushDecoder::swap_strategy`].
+///
+/// Each field is `Option`-wrapped so callers only override what they intend
+/// to change. Fields left as `None` carry their previous value through the
+/// swap.
+///
+/// [`Self::filter`] is doubly-`Option`-wrapped on purpose: the outer
+/// `Option` is "do you want to change the filter?", the inner is "set or
+/// clear?". `Some(None)` clears any previously-installed filter.
+#[derive(Debug, Default)]
+#[non_exhaustive]

Review Comment:
   I think this idea makes sense (to change the projection/selection/filter) 
but I am thinking about why to make this a new structure, as it makes some 
things akward (like Filter being `Option<Option<...>>`
   
   What do you think about putting direct functions on `ParquetPushDecoder` to 
override these 
   
   ```rust
   let decoder = ParquetPushDecoder:...
   let decoder = decoder.with_projection(new_projection)
   ```
   
   For example ? 
   
   That might make the API simpler and easier to find



##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -377,6 +379,100 @@ impl ParquetPushDecoder {
     pub fn clear_all_ranges(&mut self) {
         self.state.clear_all_ranges();
     }
+
+    /// True iff the decoder is at a row-group boundary and a

Review Comment:
   I think it would help to add a high level overview / example of how to swap 
strategies (like turn off the row filter) at boundaries
   
   Perhaps in 
https://docs.rs/parquet/latest/parquet/arrow/push_decoder/type.ParquetPushDecoderBuilder.html
 we can mention that the strategy can be changed during execution
   
   I think in particular pointing out that using `try_next_reader` is the 
important API where the strategy can be changed if needed



##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -377,6 +379,100 @@ impl ParquetPushDecoder {
     pub fn clear_all_ranges(&mut self) {
         self.state.clear_all_ranges();
     }
+
+    /// True iff the decoder is at a row-group boundary and a
+    /// [`Self::swap_strategy`] call would succeed.
+    ///
+    /// A boundary is "between row groups": the previous row group's
+    /// [`ParquetRecordBatchReader`] has been fully extracted (via
+    /// [`Self::try_next_reader`]) or fully drained (via [`Self::try_decode`]),
+    /// and the next row group has not yet been planned. While
+    /// [`Self::try_decode`] is iterating an active row group's reader this
+    /// returns `false`; with [`Self::try_next_reader`] there is a clean
+    /// window between two consecutive returns where this is `true`.
+    pub fn can_swap_strategy(&self) -> bool {
+        self.state.can_swap_strategy()
+    }
+
+    /// Number of row groups left to decode after the one currently in flight.
+    /// Useful as a "should I bother computing a new strategy?" signal.
+    pub fn row_groups_remaining(&self) -> usize {
+        self.state.row_groups_remaining()
+    }
+
+    /// Replace projection / row filter / row selection policy for subsequent
+    /// row groups.
+    ///
+    /// Returns `Err(ParquetError::General)` when called outside a row-group
+    /// boundary; check [`Self::can_swap_strategy`] first.
+    ///
+    /// The decoder's internal `PushBuffers` are preserved across the swap.
+    /// Bytes that have already been fetched for columns the new strategy
+    /// still needs will not be re-requested. Bytes for columns the new
+    /// strategy no longer needs remain buffered until
+    /// [`Self::clear_all_ranges`] is called or the decoder is dropped.
+    ///
+    /// Limit, offset, batch size, metadata, fields, and predicate-cache size
+    /// are unchanged by a swap.
+    pub fn swap_strategy(&mut self, swap: StrategySwap) -> Result<(), 
ParquetError> {
+        self.state.swap_strategy(swap)
+    }
+}
+
+/// Description of a strategy swap to apply via
+/// [`ParquetPushDecoder::swap_strategy`].
+///
+/// Each field is `Option`-wrapped so callers only override what they intend
+/// to change. Fields left as `None` carry their previous value through the
+/// swap.
+///
+/// [`Self::filter`] is doubly-`Option`-wrapped on purpose: the outer
+/// `Option` is "do you want to change the filter?", the inner is "set or
+/// clear?". `Some(None)` clears any previously-installed filter.
+#[derive(Debug, Default)]
+#[non_exhaustive]

Review Comment:
   Another thought I had on API is that we could maybe add an API to go back to 
the existign `ParuqetPushDecoderBuilder` which has all the fields, etc that can 
be adjusted. 
   
   It might be quite elegant to have something like
   
   ```rust
   let decoder = ParquetPushDecoder:...
   let new_decoder = decoder
     .into_builder()
     .with_projection()
     .build()?
   ```
   
   That would also mean that any additional parameters that got added / wanted 
to be changed during decode would have a natural API wihtout additonal plumbing



##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -377,6 +379,100 @@ impl ParquetPushDecoder {
     pub fn clear_all_ranges(&mut self) {
         self.state.clear_all_ranges();
     }
+
+    /// True iff the decoder is at a row-group boundary and a
+    /// [`Self::swap_strategy`] call would succeed.
+    ///
+    /// A boundary is "between row groups": the previous row group's
+    /// [`ParquetRecordBatchReader`] has been fully extracted (via
+    /// [`Self::try_next_reader`]) or fully drained (via [`Self::try_decode`]),
+    /// and the next row group has not yet been planned. While
+    /// [`Self::try_decode`] is iterating an active row group's reader this
+    /// returns `false`; with [`Self::try_next_reader`] there is a clean
+    /// window between two consecutive returns where this is `true`.
+    pub fn can_swap_strategy(&self) -> bool {
+        self.state.can_swap_strategy()
+    }
+
+    /// Number of row groups left to decode after the one currently in flight.
+    /// Useful as a "should I bother computing a new strategy?" signal.
+    pub fn row_groups_remaining(&self) -> usize {
+        self.state.row_groups_remaining()
+    }
+
+    /// Replace projection / row filter / row selection policy for subsequent
+    /// row groups.
+    ///
+    /// Returns `Err(ParquetError::General)` when called outside a row-group
+    /// boundary; check [`Self::can_swap_strategy`] first.
+    ///
+    /// The decoder's internal `PushBuffers` are preserved across the swap.
+    /// Bytes that have already been fetched for columns the new strategy
+    /// still needs will not be re-requested. Bytes for columns the new
+    /// strategy no longer needs remain buffered until
+    /// [`Self::clear_all_ranges`] is called or the decoder is dropped.
+    ///
+    /// Limit, offset, batch size, metadata, fields, and predicate-cache size
+    /// are unchanged by a swap.
+    pub fn swap_strategy(&mut self, swap: StrategySwap) -> Result<(), 
ParquetError> {
+        self.state.swap_strategy(swap)
+    }
+}
+
+/// Description of a strategy swap to apply via
+/// [`ParquetPushDecoder::swap_strategy`].
+///
+/// Each field is `Option`-wrapped so callers only override what they intend
+/// to change. Fields left as `None` carry their previous value through the
+/// swap.
+///
+/// [`Self::filter`] is doubly-`Option`-wrapped on purpose: the outer
+/// `Option` is "do you want to change the filter?", the inner is "set or
+/// clear?". `Some(None)` clears any previously-installed filter.
+#[derive(Debug, Default)]
+#[non_exhaustive]

Review Comment:
   Looking at it I think it might be pretty straightforward:
   
   
https://github.com/apache/arrow-rs/blob/a3592d6950b2f42784d66056ec1f992318a238d3/parquet/src/arrow/push_decoder/mod.rs#L164-L163
   
   A bunch of boiler plate destructing, but easy



##########
parquet/src/arrow/push_decoder/mod.rs:
##########
@@ -377,6 +379,100 @@ impl ParquetPushDecoder {
     pub fn clear_all_ranges(&mut self) {
         self.state.clear_all_ranges();
     }
+
+    /// True iff the decoder is at a row-group boundary and a

Review Comment:
   You have this in the tests but I fear it will be hidden / not obvious if we 
don't also add it to the docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Allow swapping projections / row filters at row group boundaries in Parquet reader [arrow-rs]

Reply via email to