adriangb commented on code in PR #20117:
URL: https://github.com/apache/datafusion/pull/20117#discussion_r2756607559


##########
datafusion/optimizer/src/extract_leaf_expressions.rs:
##########
@@ -0,0 +1,1870 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`ExtractLeafExpressions`] extracts `MoveTowardsLeafNodes` sub-expressions 
into projections.
+//!
+//! This optimizer rule normalizes the plan so that all `MoveTowardsLeafNodes` 
computations
+//! (like field accessors) live in Projection nodes immediately above scan 
nodes, making them
+//! eligible for pushdown by the `OptimizeProjections` rule.
+//!
+//! ## Algorithm
+//!
+//! This rule uses **BottomUp** traversal to push ALL `MoveTowardsLeafNodes` 
expressions
+//! (like `get_field`) to projections immediately above scan nodes. This 
enables optimal
+//! Parquet column pruning.
+//!
+//! ### Node Classification
+//!
+//! **Barrier Nodes** (stop pushing, create projection above):
+//! - `TableScan` - the leaf, ideal extraction point
+//! - `Join` - requires routing to left/right sides
+//! - `Aggregate` - changes schema semantics
+//! - `SubqueryAlias` - scope boundary
+//! - `Union`, `Intersect`, `Except` - schema boundaries
+//!
+//! **Schema-Preserving Nodes** (push through):
+//! - `Filter` - passes all input columns through
+//! - `Sort` - passes all input columns through
+//! - `Limit` - passes all input columns through
+//! - Passthrough `Projection` - only column references
+//!
+//! ### How It Works
+//!
+//! 1. Process leaf nodes first (TableScan, etc.)
+//! 2. When processing higher nodes, descendants are already finalized
+//! 3. Push extractions DOWN through the plan, merging into existing extracted
+//!    expression projections when possible
+
+use indexmap::{IndexMap, IndexSet};
+use std::sync::Arc;
+
+use datafusion_common::alias::AliasGenerator;
+use datafusion_common::tree_node::{Transformed, TreeNode, TreeNodeRecursion};
+use datafusion_common::{Column, DFSchema, Result};
+use datafusion_expr::expr_rewriter::NamePreserver;
+use datafusion_expr::logical_plan::LogicalPlan;
+use datafusion_expr::{Expr, ExpressionPlacement, Filter, Limit, Projection, 
Sort};
+
+use crate::optimizer::ApplyOrder;
+use crate::utils::{EXTRACTED_EXPR_PREFIX, has_all_column_refs, 
is_extracted_expr_projection};
+use crate::{OptimizerConfig, OptimizerRule};
+
+/// Extracts `MoveTowardsLeafNodes` sub-expressions from all nodes into 
projections.
+///
+/// This normalizes the plan so that all `MoveTowardsLeafNodes` computations 
(like field
+/// accessors) live in Projection nodes, making them eligible for pushdown.
+///
+/// # Example
+///
+/// Given a filter with a struct field access:
+///
+/// ```text
+/// Filter: user['status'] = 'active'
+///   TableScan: t [user]
+/// ```
+///
+/// This rule extracts the field access into a projection:
+///
+/// ```text
+/// Filter: __datafusion_extracted_1 = 'active'
+///   Projection: user['status'] AS __datafusion_extracted_1, user
+///     TableScan: t [user]
+/// ```
+///
+/// The `OptimizeProjections` rule can then push this projection down to the 
scan.
+///
+/// **Important:** The `PushDownFilter` rule is aware of projections created 
by this rule
+/// and will not push filters through them. See `is_extracted_expr_projection` 
in utils.rs.
+#[derive(Default, Debug)]
+pub struct ExtractLeafExpressions {}
+
+impl ExtractLeafExpressions {
+    /// Create a new [`ExtractLeafExpressions`]
+    pub fn new() -> Self {
+        Self {}
+    }
+}
+
+impl OptimizerRule for ExtractLeafExpressions {
+    fn name(&self) -> &str {
+        "extract_leaf_expressions"
+    }
+
+    fn apply_order(&self) -> Option<ApplyOrder> {
+        Some(ApplyOrder::BottomUp)
+    }
+
+    fn rewrite(
+        &self,
+        plan: LogicalPlan,
+        config: &dyn OptimizerConfig,
+    ) -> Result<Transformed<LogicalPlan>> {
+        let alias_generator = config.alias_generator();
+        extract_from_plan(plan, alias_generator)
+    }
+}
+
+/// Extracts `MoveTowardsLeafNodes` sub-expressions from a plan node.
+///
+/// With BottomUp traversal, we process leaves first, then work up.
+/// This allows us to push extractions all the way down to scan nodes.
+fn extract_from_plan(
+    plan: LogicalPlan,
+    alias_generator: &Arc<AliasGenerator>,
+) -> Result<Transformed<LogicalPlan>> {
+    match &plan {
+        // Schema-preserving nodes - extract and push down
+        LogicalPlan::Filter(_) | LogicalPlan::Sort(_) | LogicalPlan::Limit(_) 
=> {
+            extract_from_schema_preserving(plan, alias_generator)
+        }
+
+        // Schema-transforming nodes need special handling
+        LogicalPlan::Aggregate(_) => extract_from_aggregate(plan, 
alias_generator),
+        LogicalPlan::Projection(_) => extract_from_projection(plan, 
alias_generator),
+        LogicalPlan::Join(_) => extract_from_join(plan, alias_generator),
+
+        // Everything else passes through unchanged
+        _ => Ok(Transformed::no(plan)),

Review Comment:
   I'm not sure what else we could handle here. Maybe Extension?
   
   Before we merge this PR we expand this to explicitly ignore all other nodes 
so that if a new node is added one has to decide how this rule should handle 
it. I'll wait to do that since that's another +30 LOC diff.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to