adriangb commented on code in PR #20065:
URL: https://github.com/apache/datafusion/pull/20065#discussion_r2749769487


##########
datafusion/expr-common/src/placement.rs:
##########
@@ -0,0 +1,59 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Expression placement information for optimization decisions.
+
+/// Describes where an expression should be placed in the query plan for
+/// optimal execution. This is used by optimizers to make decisions about
+/// expression placement, such as whether to push expressions down through
+/// projections.
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
+pub enum ExpressionPlacement {
+    /// A constant literal value.
+    Literal,
+    /// A simple column reference.
+    Column,
+    /// A cheap expression that can be pushed to leaf nodes in the plan.
+    /// Examples include `get_field` for struct field access.
+    PlaceAtLeaves,
+    /// An expensive expression that should stay at the root of the plan.
+    /// This is the default for most expressions.
+    PlaceAtRoot,
+}
+
+impl ExpressionPlacement {
+    /// Returns true if the expression can be pushed down to leaf nodes
+    /// in the query plan.
+    ///
+    /// This returns true for:
+    /// - `Column`: Simple column references can be pushed down. They do no 
compute and do not increase or
+    ///   decrease the amount of data being processed.
+    ///   A projection that reduces the number of columns can eliminate 
unnecessary data early,
+    ///   but this method only considers one expression at a time, not a 
projection as a whole.
+    /// - `PlaceAtLeaves`: Cheap expressions can be pushed down to leaves to 
take advantage of
+    ///   early computation and potential optimizations at the data source 
level.
+    ///   For example `struct_col['field']` is cheap to compute (just an Arc 
clone of the nested array for `'field'`)
+    ///   and thus can reduce data early in the plan at very low compute cost.
+    ///   It may even be possible to eliminate the expression entirely if the 
data source can project only the needed field
+    ///   (as e.g. Parquet can).
+    pub fn should_push_to_leaves(&self) -> bool {
+        matches!(
+            self,
+            ExpressionPlacement::Column | ExpressionPlacement::PlaceAtLeaves

Review Comment:
   Doing a bit of investigation:
   
   ```
   ❯ cargo run -p datafusion-cli                                                
                                                                                
                               
     Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.50s        
                                                                                
                                 
     Running `target/debug/datafusion-cli`                                      
                                                                                
                                 
     DataFusion CLI v52.1.0                                                     
                                                                                
                                 
     > create table t (a int, b int);                                           
                                                                                
                                 
     0 row(s) fetched.                                                          
                                                                                
                                 
     Elapsed 0.009 seconds.                                                     
                                                                                
                                 
                                                                                
                                                                                
                                 
     > explain format indent select ((select a from t where b > 1) + 1) as b 
from t where a < 10;                                                            
                                    
     
+---------------+-----------------------------------------------------------------------------+
                                                                                
             
     | plan_type     | plan                                                     
                   |                                                            
                                 
     
+---------------+-----------------------------------------------------------------------------+
                                                                                
             
     | logical_plan  | Projection: __scalar_sq_1.t.a + Int64(1) AS b            
                   |                                                            
                                 
     |               |   Left Join:                                             
                   |                                                            
                                 
     |               |     Projection:                                          
                   |                                                            
                                 
     |               |       Filter: t.a < Int32(10)                            
                   |                                                            
                                 
     |               |         TableScan: t projection=[a]                      
                   |                                                            
                                 
     |               |     SubqueryAlias: __scalar_sq_1                         
                   |                                                            
                                 
     |               |       Projection: CAST(t.a AS Int64)                     
                   |                                                            
                                 
     |               |         Filter: t.b > Int32(1)                           
                   |                                                            
                                 
     |               |           TableScan: t projection=[a, b]                 
                   |                                                            
                                 
   ```
   
   After decorrelation `__scalar_sq_1.t.a` is just a regular Expr::Column 
reference (returning ExpressionPlacement::Column). The `ExpressionPlacement` 
describes the nature of the expression - whether it's cheap/expensive to 
compute - not where it can actually go in the plan. A column reference is 
always zero-compute (just a lookup), so it's always classified as Column and 
can be pushed down. The optimizer separately validates that the column exists, 
in the case of a join by checking each side of the join, etc. This logic 
already exists and this PR should not modify it.
   
   I hope this answers the question?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to