alamb opened a new issue, #18845:
URL: https://github.com/apache/datafusion/issues/18845

   ### Is your feature request related to a problem or challenge?
   
   Many parts of the planning process require knowing the type,  nullability 
and recently the field information for each `Expr`
   
   * 
[`Expr::to_field`](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#method.to_field)
   * 
[`Expr::get_type`](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#method.get_type)
   * 
[`Expr::nullable`](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#method.nullable)
   
   The implementation of these methods recomputes the type information on 
demand, for example 
https://github.com/apache/datafusion/blob/4f95e6d70249d0d839939facc67f95cd7f54526e/datafusion/expr/src/expr_schema.rs#L109
   
   This is slow for several reasons:
   1. The computation is re-computed many 
[times](https://github.com/search?q=repo%3Aapache%2Fdatafusion%20get_type&type=code)
 [during](repo:apache/datafusion nullable) planning
   2. The computations themselves are sometimes non trivial (e.g. what 
@pepijnve  added in https://github.com/apache/datafusion/pull/17813 for case 
expressions)
   3. The calculations have to walk the entire expression tree, each time even 
when they don't change
   
   
   I have some small evidence that this type computation takes a non trivial 
amount of planning time, as for example, when I removed some of the cloning in 
`to_field` in this PR, I saw a significant improvement in planning time (2-3%)
   - https://github.com/apache/datafusion/pull/18415
   
   ### Describe the solution you'd like
   
   I would like to find some way to avoid recomputing the results of these 
calculations for Expressions
   
   
   
   ### Describe alternatives you've considered
   
   DataFusion already has a similar pattern for `ExecutionPlan` where the 
properties are computed once and then each ExecutionPlan must report the 
properties. 
   
   * 
https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.properties
   * 
https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.PlanProperties.html
   
   We also need to be careful about rolling this change incrementally rather 
than as one giant PR out as it will be a potentially disruptive API change (as 
Exprs are used all over the place)
   
   So one way to cache this would be to add a method like 
   
   ```rust
   /// Cached result of computing type for 
   struct ExprProperties {
     field: FieldRef,
     table_reference: Option<TableReference>
   }
   ```
   
   trait ExprSchemable {
     /// return cached properties for the Expr, if available
     fn try_properties(&self) -> Option<&ExprProperties>;
   }
   
   impl Expr {
     fn get_type(&self) -> DataType {
       // first check if we have pre-computed results
       if let Some(properties) = self.try_properties() {
         return properties.field.data_type()
       }
       // otherwise compute as normal
     }
   
     // add similar checks for nullable and to_field
   }
   
   Then we could  add an `ExprProperties` field to various `Expr`s over 
multiple PRs and roll out the cached values
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to