Re: [PR] [Parquet] Adaptive Parquet Predicate Pushdown [arrow-rs]

via GitHub Fri, 07 Nov 2025 23:30:06 -0800


hhhizzz commented on code in PR #8733:
URL: https://github.com/apache/arrow-rs/pull/8733#discussion_r2506350668



##########
parquet/src/arrow/arrow_reader/read_plan.rs:
##########
@@ -20,19 +20,70 @@
 
 use crate::arrow::array_reader::ArrayReader;
 use crate::arrow::arrow_reader::{
-    ArrowPredicate, ParquetRecordBatchReader, RowSelection, RowSelector,
+    ArrowPredicate, ParquetRecordBatchReader, RowSelection, RowSelectionCursor,
+    RowSelectionStrategy, RowSelector,
 };
 use crate::errors::{ParquetError, Result};
 use arrow_array::Array;
 use arrow_select::filter::prep_null_mask_filter;
-use std::collections::VecDeque;
+use std::sync::atomic::{AtomicUsize, Ordering};
+
+// The average selector length threshold for choosing between
+// `RowSelectionStrategy::Mask` and `RowSelectionStrategy::Selectors`.
+// If the average selector length is less than this value,
+// `RowSelectionStrategy::Mask` is preferred.
+const AVG_SELECTOR_LEN_MASK_THRESHOLD: usize = 32;
+
+// The logic in `preferred_selection_strategy` depends on the constant
+// `AVG_SELECTOR_LEN_MASK_THRESHOLD`. To allow unit testing of this logic,
+// we use a mutable global variable that can be temporarily changed during 
tests.
+//
+// An `AtomicUsize` is used because the Rust test runner (`cargo test`) runs 
tests
+// in parallel by default. The atomic operations prevent data races between
+// different test threads that might try to modify this value simultaneously.
+//
+// For the production code path, `load(Ordering::Relaxed)` is used. This is the
+// weakest memory ordering and for a simple load on most modern architectures,
+// it compiles down to a regular memory read with negligible performance 
overhead.
+// The more expensive atomic operations with stronger ordering are only used 
in the
+// test-only functions below.
+static AVG_SELECTOR_LEN_MASK_THRESHOLD_OVERRIDE: AtomicUsize =
+    AtomicUsize::new(AVG_SELECTOR_LEN_MASK_THRESHOLD);
+
+#[inline(always)]
+fn avg_selector_len_mask_threshold() -> usize {
+    AVG_SELECTOR_LEN_MASK_THRESHOLD_OVERRIDE.load(Ordering::Relaxed)
+}
+
+/// An RAII guard that restores the previous value of the override when it is 
dropped.
+/// This ensures that any change to the global threshold is temporary and 
scoped to
+/// the test or benchmark where it's used, even in the case of a panic.
+pub struct AvgSelectorLenMaskThresholdGuard {
+    previous: usize,
+}
+
+impl Drop for AvgSelectorLenMaskThresholdGuard {
+    fn drop(&mut self) {
+        AVG_SELECTOR_LEN_MASK_THRESHOLD_OVERRIDE.store(self.previous, 
Ordering::SeqCst);
+    }
+}
+
+/// Override AVG_SELECTOR_LEN_MASK_THRESHOLD (primarily for tests / 
benchmarks).
+///
+/// Returns an [`AvgSelectorLenMaskThresholdGuard`] that restores the previous 
value on drop.
+pub fn set_avg_selector_len_mask_threshold(value: usize) -> 
AvgSelectorLenMaskThresholdGuard {
+    let previous = AVG_SELECTOR_LEN_MASK_THRESHOLD_OVERRIDE.swap(value, 
Ordering::SeqCst);
+    AvgSelectorLenMaskThresholdGuard { previous }
+}
 
 /// A builder for [`ReadPlan`]
 #[derive(Clone, Debug)]
 pub struct ReadPlanBuilder {
     batch_size: usize,
     /// Current to apply, includes all filters
     selection: Option<RowSelection>,
+    /// Strategy to use when materialising the row selection
+    selection_strategy: RowSelectionStrategy,

Review Comment:
   I removed `Auto` from the current design and didn't add it to the Builder 
API because `Mask` is currently unstable, requiring us to fall back to 
`Selection` in certain cases. I believe this unstable setting shouldn't be 
directly exposed to users. Moving forward, we could consider:
   1. Allowing users to choose only `Auto`, `Selection`, or other stable 
strategies, implemented via runtime checks or a new enum. `Mask` would only be 
selected heuristically when `Auto` is chosen.
   2. Stabilizing `Mask` by updating the fetch process, ensuring all pages are 
fetched if the user selects `Mask`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Parquet] Adaptive Parquet Predicate Pushdown [arrow-rs]

Reply via email to