asolimando commented on code in PR #22235:
URL: https://github.com/apache/datafusion/pull/22235#discussion_r3280521346


##########
datafusion/datasource-parquet/src/page_filter.rs:
##########
@@ -336,6 +410,190 @@ impl PagePruningAccessPlanFilter {
     pub fn filter_number(&self) -> usize {
         self.predicates.len()
     }
+
+    /// Like [`Self::prune_plan_with_page_index`] but also surfaces, as a
+    /// side-effect of the pruning iteration that already runs, a
+    /// per-conjunct accumulator with the rows that conjunct alone
+    /// would have proven skippable. Callers use this to seed a
+    /// per-FilterId selectivity prior without doing any extra pruning
+    /// work — every page-index lookup that would have happened in
+    /// `prune_plan_with_page_index` happens exactly once here too.
+    pub fn prune_plan_with_per_conjunct_stats(
+        &self,
+        mut access_plan: ParquetAccessPlan,
+        arrow_schema: &Schema,
+        parquet_schema: &SchemaDescriptor,
+        parquet_metadata: &ParquetMetaData,
+        file_metrics: &ParquetFileMetrics,
+    ) -> (ParquetAccessPlan, Vec<PerConjunctPageStats>, usize) {
+        // scoped timer updates on drop
+        let _timer_guard = file_metrics.page_index_eval_time.timer();
+
+        let mut per_conjunct: Vec<PerConjunctPageStats> = 
(0..self.predicates.len())
+            .map(|i| PerConjunctPageStats {
+                tag: self.tags.as_ref().and_then(|t| t.get(i).copied()),
+                rows_seen: 0,
+                rows_skipped: 0,
+            })
+            .collect();
+
+        if self.predicates.is_empty() {
+            return (access_plan, per_conjunct, 0);
+        }
+
+        let groups = parquet_metadata.row_groups();
+        if groups.is_empty() {
+            return (access_plan, per_conjunct, 0);
+        }
+
+        if parquet_metadata.offset_index().is_none()
+            || parquet_metadata.column_index().is_none()
+        {
+            return (access_plan, per_conjunct, 0);
+        }
+
+        // Same accumulators as the untagged path, plus per-conjunct.

Review Comment:
   The comment shows that you are aware of the duplication w.r.t. the untagged 
path, is this a temporary state for the draft? (with draft PRs it's sometimes a 
bit tricky to tell what is what, apologies if that should be evident at this 
point)



##########
datafusion/pruning/src/pruning_predicate.rs:
##########
@@ -568,6 +697,50 @@ impl PruningPredicate {
         Ok(builder.build())
     }
 
+    /// Like [`Self::prune`] but also returns per-conjunct pruning
+    /// stats when this predicate was constructed via
+    /// [`Self::try_new_tagged_conjuncts`]. The `Vec<bool>` is the same
+    /// AND'd result `prune` would return; the per-conjunct stats are
+    /// computed by evaluating each leaf sub-predicate against the same
+    /// `statistics` and counting pruned/seen containers.
+    ///
+    /// Returns `(prune_result, vec![])` for predicates constructed via
+    /// the plain [`Self::try_new`] (no sub-predicates available).
+    pub fn prune_per_conjunct<S: PruningStatistics + ?Sized>(
+        &self,
+        statistics: &S,
+    ) -> Result<(Vec<bool>, Vec<PerConjunctPruneStats>)> {
+        let combined = self.prune(statistics)?;
+        let Some(sub_predicates) = &self.sub_predicates else {
+            return Ok((combined, Vec::new()));
+        };
+
+        let total_containers = statistics.num_containers();
+        let mut per_conjunct: Vec<PerConjunctPruneStats> =
+            Vec::with_capacity(sub_predicates.len());
+        for sub in sub_predicates {
+            let kept = sub.predicate.prune(statistics)?;

Review Comment:
   It seems that we are looping through the predicates twice, the second time 
here, but IIUC we already do a full traversal at line 713, I wonder if we could 
do both at the same time as this might be costly with many conjuncts/deep 
expressions?



##########
datafusion/pruning/src/pruning_predicate.rs:
##########
@@ -499,9 +544,77 @@ impl PruningPredicate {
             required_columns,
             orig_expr: expr,
             literal_guarantees,
+            sub_predicates: None,
         })
     }
 
+    /// Like [`Self::try_new`] but takes already-split top-level
+    /// conjuncts each carrying a caller tag (typically a `FilterId`).
+    /// Builds one [`PruningPredicate`] per conjunct as a leaf
+    /// sub-predicate. The wrapper itself is a marker holding the
+    /// leaves; calls to [`Self::prune`] AND the leaves' results,
+    /// avoiding a redundant combined-predicate construction.
+    /// [`Self::prune_per_conjunct`] also reads from the same leaves.
+    ///
+    /// Conjuncts whose individual `try_new` would error or return
+    /// always-true are silently skipped (their tags do not appear in
+    /// the per-conjunct output).
+    pub fn try_new_tagged_conjuncts(
+        tagged: &[(usize, Arc<dyn PhysicalExpr>)],
+        schema: SchemaRef,
+    ) -> Result<Self> {
+        // Build per-conjunct PruningPredicates (each is a leaf — i.e.
+        // its own `sub_predicates` is `None`).
+        let mut sub_predicates: Vec<TaggedSubPredicate> = Vec::new();
+        for (tag, expr) in tagged {
+            match Self::try_new(Arc::clone(expr), Arc::clone(&schema)) {
+                Ok(p) if !p.always_true() => {
+                    sub_predicates.push(TaggedSubPredicate {
+                        tag: Some(*tag),
+                        predicate: p,
+                    });
+                }
+                Ok(_) => {
+                    // always-true: skip; leaves the tag unrepresented.
+                    continue;
+                }
+                Err(e) => {
+                    debug!("try_new_tagged_conjuncts: skipping conjunct {tag}: 
{e}");
+                    continue;
+                }
+            }
+        }
+
+        // Build a marker wrapper. Its own `predicate_expr` is a
+        // literal `true` placeholder; `prune` is special-cased below

Review Comment:
   There is no caller for this today outside the adaptive filter machinery, but 
I am wondering if people might start using it, expecting meaningful values here 
and misuse the placeholder to make incorrect decisions.
   
   I wanted to suggest to try and delegate to leaves for 
`literal_guarantees()`, `required_columns()` and `predicate_expr()`, but it 
seems it's not doable and we would need to build the "real" result at 
construction time and we can't do it at the end.
   
   This again might be just postponed for the non-draft version, but it seems 
this would need some careful thinking in the design phase



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to