Re: [PR] Refactor `TreeNode.rewrite()` [arrow-datafusion]

via GitHub Fri, 19 Jan 2024 02:24:00 -0800


peter-toth commented on PR #8891:
URL: 
https://github.com/apache/arrow-datafusion/pull/8891#issuecomment-1900138152


   Thanks for the question @alamb! I'm new to DataFusion so this discussion 
could help me a lot to understand what ways are viable in terms of improving 
the existing code. I came from a different open-source query engine but I think 
`TreeNode` APIs are fundamental for each engine. While I was getting familiar 
with DataFusion I stumbled upon quite a few inconsistencies and missing 
features regarding these APIs. I was trying to throw in some ideas with my PRs 
that could help improving the situation. Some of these PRs are simply trying to 
improve consistency some others are performance related.
   
   Here are a few things I found weird with the current API that I think can 
cause a newcommer (like me) quite some confusion:
   - The `apply` / `visit` functions are capable to prune (skip) subtrees or 
return immediately (stop) but the `transform` functions are not. To make it 
more confusion `rewrite` is also capable to to prune, but instead of using a 
common way to control recursion, `visit` and `rewrite` use different enums with 
conflicting semantics.
      See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for 
details on `VisitRecursion` vs `RewriteRecursion`.
   - There is this `Transformed` enum whose purpose is to explicitely define if 
any change was made to the tree. This enum is used in transform functions but 
missing from rewrite. Moreover this explicit information is neither propogatged 
up to callee nor it is used for detecting changes.
      See details in https://github.com/apache/arrow-datafusion/pull/8835. I 
believe the current state simply doesn't make sense and just causes confusion.
   - `rewrite` also seems to be inconsistent with `transform_up` and 
`transform_down`. `rewrite` probably organically ended up in its current state 
and capable to do its current job, but what I would expect from such a function 
is that it simply incorporates 2 closures from `transform_up` and 
`transform_down` to define:
     - what transformation should be done before (pre-order) and after 
(post-order) transformation of the node's children
     - and how the recursion should continue .
     
     See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for the 
details. IMO the current `TreeNodeRewriter.pre_visit()` seems like a mess and 
so hard to grasp.
   
   What features I miss from the current API:
   - Pruning capability of `transform` functions:
      Pruning would be very important as many of the tranformations wouldn't 
require traversing on the whole tree and could improve performance a lot.
   - Payload propagation:
      I think the the reason why there are so many (4 base + 9 derived) tree 
implementations (with many code duplications) in DataFusion is that there is no 
API to describe a transformation while also propagating state. In 
https://github.com/apache/arrow-datafusion/pull/8664 with the help of 
`transform_with_payload` I not just eleminated 7 derived trees but also 
improved some of the algorightms to require only one traversal (also a perf 
improvement).
   
   So to recap my ideal `TreeNode` transformation API would look like this:
   ```rust
       /// Transforms the tree using `f_down` and `f_up` closures. `f_down` is 
applied on a
       /// node while traversing the tree top-down (pre-order, before the 
node's children are
       /// visited) while `f_up` is applied on a node while traversing the tree 
bottom-up
       /// (post-order, after the the node's children are visited).
       ///
       /// The `f_down` closure takes
       /// - a `PD` type payload from its parent
       /// and returns a tuple made of:
       /// - a possibly modified node,
       /// - a `PC` type payload to pass to `f_up`,
       /// - a `Vec<FD>` type payload to propagate down to its children
       ///    (one `FD` element is propagated down to each child),
       /// - a `TreeNodeRecursion` enum element to control recursion flow.
       ///
       /// The `f_up` closure takes
       /// - a `PC` type payload from `f_down` and
       /// - a `Vec<PU>` type payload collected from its children
       /// and returns a tuple made of:
       /// - a possibly modified node,
       /// - a `FU` type payload to propagate up to its parent,
       /// - a `TreeNodeRecursion` enum element to control recursion flow.
       fn transform_with_payload<FD, PD, PC, FU, PU>(
           self,
           f_down: &mut FD,
           payload_down: PD,
           f_up: &mut FU,
       ) -> Result<(Self, PU)>
       where
           FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, 
TreeNodeRecursion)>,
           FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, 
TreeNodeRecursion)>,
   ```
   
   Obviously the closure return types can be extracted to a type alias or 
`struc`s like in the case of `f_down` could be:
   ```rust
   pub struct TransformDownResult<N, PD, PC> {
       pub transformed_node: N,
       pub payload_to_children: Vec<PD>,
       pub payload_to_f_up: PC,
       pub recursion_control: TreeNodeRecursion,
   }
   ```
   
   All the other functions should be just specialization of the above:
   - `apply` (`visit_down`): Only `f_down` closure that doesn't return a 
modified node and payload (only retrurns `TreeNodeRecursion`)
   - `visit` / `TreeNodeVisitor`: Both `f_down` and `f_up` closures needed. The 
closures can be incorporated into a `TreeNodeVisitor` object but they don't 
return a modified node and payload
   - `transform_down`: Only `f_down` closure that doesn't return payload
   - `transform_up`: Only `f_up` closure that doesn't return payload
   - `transform`: Both `f_down` and `f_up` closures needed, but they don't 
return payloads
   - `rewrite` / `TreeNodeRewriter`: Both `f_down` and `f_up` are incorporated 
into a `TreeNodeRewriter` object, but they don't return a payload
   - `transform_down_with_payload`: Only `f_down` closure
   - `transform_up_with_payload`: Only `f_up` closure
   
   I believe all these PRs would take us forward to a simple but still 
versarile APIs that have/are:
   - performance benefits (e.g. pruning during transformation + improved 
algorithms in https://github.com/apache/arrow-datafusion/pull/8664),
   - reduced code maintenance costs (e.g. see less dereived trees in 
https://github.com/apache/arrow-datafusion/pull/8664)
   - consistent and easy to grasp (e.g. this PR + 
https://github.com/apache/arrow-datafusion/pull/8835)
   
   Now, as you mentioned some of these proposed changes have conflicting 
semantics to current APIs. Honestly I can't decide how much turbulance they 
would cause and so I'm seeking feedback from you and the community. But please 
note that I don't insist on any of the above functionaly or namings. These are 
just ideas and I can imagine implementations where:
   - we don't touch any existing API (maybe deprecate old ones)
   - or modify existing APIs to make them consistent, but also keep old API 
alternatives to ease the transition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Refactor `TreeNode.rewrite()` [arrow-datafusion]

Reply via email to