alamb commented on issue #9577: URL: https://github.com/apache/arrow-datafusion/issues/9577#issuecomment-1997488393
> I think, replacing Box<Expr> usages with Arc<Expr> under the enum Expr would improve performance. I basically agree with this assessment but I have an alternate proposal for how to improve performance 1. I agree that the current `clone()` of Exprs (and `Box::clone` does a deep copy) 2. I believe that the `clone`ing of Exprs takes a significant amount of planning time in DataFusion (I was looking at `cargo bench --bench sql_planner` the other day and substantial amounts of time are spent cloning 3. I think the "ideal" rust pattern for rewriting Exprs is what we have in our ExprRewriters -- take an owned `Expr` and then destructure / rewrite it as needed 4. There are many places in our code where `Expr::clone()` is called Unecessirly 5. If we changed Expr to use `Arc` instead of `Box` I think we would get a planning speedup, but it would be less performant than the ideal pattern (as rewritten nodes would likely rewrite copying, much like LogicalPlan), and it would be a large API change for downstream consumers Here is an example of what I think is a good pattern (there are no copies except when needed) https://github.com/apache/arrow-datafusion/blob/9d0c05b9c2e5f9e774ab1ea08599ac8d4b23c93f/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L638-L648 Here is an example of where Expr cloning is being used unnecessarily https://github.com/apache/arrow-datafusion/blob/9d0c05b9c2e5f9e774ab1ea08599ac8d4b23c93f/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L45-L49 Thus my suggestion is to go through the planner and remove the calls to `Expr::clone()` as much as possible (likely letting `cargo bench --bench sql_planner` be our guide. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
