alamb commented on PR #9780: URL: https://github.com/apache/arrow-datafusion/pull/9780#issuecomment-2031702263
Thank you for the feedback @peter-toth > I wonder if it would make sense to split this PR into 2 parts as it seem to mix 2 very different ideas. I can definitely split it into two PRs. My rationale for one PR was that a PR for `TreeNode::mutate` API would be too abstract. > * It introduces an in place mutation `TreeNode` API. First of all, I really like that API and I too think it can be faster than the current consume and procude style ones (`TreeNode::rewrite()`, `TreeNode::transform_down_up()`). But I think the only reason why it is faster is because only smaller references (e.g. `&mut LogicalPlan`s) are stored on the stack during a transformation and not the original structures themselves (`LogicalPlan`s). I agree > Otherwise the 2 APIs are very similar, there should be nothing that you can do with a `&mut LogicalPlan`s but you can't do with an owned `LogicalPlan`. The only difference I see in the APIs is that using `&mut LogicalPlan` leaves a partially rewritten node on error, whereas an owned `LogicalPlan` consumes the node on error. > * It uses a trick in `rewrite_arc` to avoid copying `Arc`s. But this trick shouldn't require introducing the new API, it should simply work with changing the existing `LogicalPlan::map_children()`. Changing `map_children` to rewrite the Arc'd inputs is a very interesting idea -- I think we could do that in a separate PR. I'll give it a try > So I don't think we need a new API for #9768. Don't get me wrong, I favour the new API but I would rather move that to a completely separate PR and slowly transform the code to using that in the future. > > As for the new `LogicalPlan` methods (`rewrite_inputs()`, `rewrite_subqueries()`, `rewrite_exprs()`, ...) in this PR, the problem is that they use the new `Transformed::and_then()` method to iterate over sibling nodes in the tree (e.g. https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR188-R189 or https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR229-R230). But that's not correct as the new `and_then()` doesn't respect `TreeNodeRecursion::Stop` behaviour. We already have `TransformedIterator::map_until_stop_and_collect()` and `TreeNode::try_transform_node()` to chain transformations in such cases (see examples in `Expr::map_children()`). BTW, I'm trying to simplyfy those `try_transform_node` chained calls in #9876. This is also an excellent call -- I actually tried initially to use `TransformedIterator::map_until_stop_and_collect()` but I couldn't make it work with the generics (it needed a function that took `(Self)` as I recall. So how about this for a plan: 1. I'll make a PR that changes LogicalPlan::`map_children` to avoid copies if possible (which should be beneficial in many places in DataFusion) 2. I'll create a PR that changes the Optimizer to use a TreeNodeRewriter (rather than TreeNodeMutator) which will avoid even more copies after 1 is complete. We can then compare the performance of the TreeNodeRewriter to this PR with TreeNodeMutator and decide and see if the additional TreeNodeMutator API is worth the extra API surface / migration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
