Re: [PR] Introduce `TreeNodeMutator` for rewriting `TreeNode`s in place, change optimizer to rewrite `LogicalPlan` in place - 10% faster planning time [arrow-datafusion]

via GitHub Tue, 02 Apr 2024 04:00:53 -0700


alamb commented on PR #9780:
URL: 
https://github.com/apache/arrow-datafusion/pull/9780#issuecomment-2031702263


   Thank you for the feedback @peter-toth 
   
   > I wonder if it would make sense to split this PR into 2 parts as it seem 
to mix 2 very different ideas.
   
   I can definitely split it into two PRs. My rationale for one PR was that a 
PR for `TreeNode::mutate` API would be too abstract. 
   
   > * It introduces an in place mutation `TreeNode` API. First of all, I 
really like that API and I too think it can be faster than the current consume 
and procude style ones (`TreeNode::rewrite()`, 
`TreeNode::transform_down_up()`). But I think the only reason why it is faster 
is because only smaller references (e.g. `&mut LogicalPlan`s) are stored on the 
stack during a transformation and not the original structures themselves 
(`LogicalPlan`s). 
   
   I agree
   
   > Otherwise the 2 APIs are very similar, there should be nothing that you 
can do with a `&mut LogicalPlan`s but you can't do with an owned `LogicalPlan`.
   
   The only difference I see in the APIs is that using `&mut LogicalPlan` 
leaves a partially rewritten node on error, whereas an owned `LogicalPlan` 
consumes the node on error. 
   
   
   > * It uses a trick in `rewrite_arc` to avoid copying `Arc`s. But this trick 
shouldn't require introducing the new API, it should simply work with changing 
the existing `LogicalPlan::map_children()`.
   
   Changing `map_children` to rewrite the Arc'd inputs is a very interesting 
idea -- I think we could do that in a separate PR. I'll give it a try
   
   > So I don't think we need a new API for #9768. Don't get me wrong, I favour 
the new API but I would rather move that to a completely separate PR and slowly 
transform the code to using that in the future.
   > 
   > As for the new `LogicalPlan` methods (`rewrite_inputs()`, 
`rewrite_subqueries()`, `rewrite_exprs()`, ...) in this PR, the problem is that 
they use the new `Transformed::and_then()` method to iterate over sibling nodes 
in the tree (e.g. 
https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR188-R189
 or 
https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR229-R230).
 But that's not correct as the new `and_then()` doesn't respect 
`TreeNodeRecursion::Stop` behaviour. We already have 
`TransformedIterator::map_until_stop_and_collect()` and 
`TreeNode::try_transform_node()` to chain transformations in such cases (see 
examples in `Expr::map_children()`). BTW, I'm trying to simplyfy those 
`try_transform_node` chained calls in #9876.
   
   This is also an excellent call -- I actually tried initially to use 
`TransformedIterator::map_until_stop_and_collect()` but I couldn't make it work 
with the generics (it needed a function that took `(Self)` as I recall.
   
   
   So how about this for a plan:
   1. I'll make a PR that changes LogicalPlan::`map_children` to avoid copies 
if possible (which should be beneficial in many places in DataFusion)
   2. I'll create a PR that changes the Optimizer to use a TreeNodeRewriter 
(rather than TreeNodeMutator) which will avoid even more copies after 1 is 
complete. 
   
   We can then compare the performance of the TreeNodeRewriter to this PR with 
TreeNodeMutator and decide and see if the additional TreeNodeMutator API is 
worth the extra API surface / migration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Introduce `TreeNodeMutator` for rewriting `TreeNode`s in place, change optimizer to rewrite `LogicalPlan` in place - 10% faster planning time [arrow-datafusion]

Reply via email to