alamb opened a new issue, #8913: URL: https://github.com/apache/arrow-datafusion/issues/8913
### Is your feature request related to a problem or challenge? (note this is mostly from @peter-toth 's description on https://github.com/apache/arrow-datafusion/pull/8891#issuecomment-1900138152) As @peter-toth noted, the current `TreeNode` APIs are somewhat inconsistent which has the following challenges: 1. adds a barrier to entry for newcomers to the DataFusion codebase 3. The APIs sometimes can not (or do no) implement pruning / early tree traversal termination and therefore do more tree walking / transforming than necessary 2. Increases code maintenance costs ### Specific Challenges of the Current APIs Some specific challenges with the current APIs. These can cause a newcommer (like me) some confusion: - The `apply` / `visit` functions are capable to prune (skip) subtrees or return immediately (stop) but the `transform` functions are not. To make it more confusion `rewrite` is also capable to to prune, but instead of using a common way to control recursion, `visit` and `rewrite` use different enums with conflicting semantics. See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for details on `VisitRecursion` vs `RewriteRecursion`. - There is this `Transformed` enum whose purpose is to explicitely define if any change was made to a node. This enum is used in `transform` clousres but for some reason it is missing from `rewrite`. Moreover this explicit information is neither propogatged up to API callee nor it is used for detecting changes in nodes. See details in https://github.com/apache/arrow-datafusion/pull/8835. I believe the current state simply doesn't make sense and just causes confusion. - `rewrite` also seems to be inconsistent with `transform_up` and `transform_down`. `rewrite` probably organically ended up in its current state and capable to do its current job, but what I would expect from such a function is that it simply incorporates 2 closures from `transform_up` and `transform_down` to define: - what transformation should be done before (pre-order) and after (post-order) transformation of the node's children - and how the recursion should continue . See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for the details. IMO the current `TreeNodeRewriter.pre_visit()` seems like a mess and its logic is hard to grasp. ### Specific missing APIs - Pruning capability of `transform` functions: Pruning would be very important as many of the transformations wouldn't require traversing on the whole tree and could improve performance a lot. - Payload propagation: I think the the reason why there are so many (4 base + 9 derived) tree implementations (with many code duplications) in DataFusion is that there is no API to describe a transformation while also propagating state. In https://github.com/apache/arrow-datafusion/pull/8664 with the help of `transform_with_payload` I not just eleminated 7 derived trees but also improved some of the algorightms to require only one traversal (also a perf improvement). ### Describe the solution you'd like The proposal is a general API like this: ```rust /// Transforms the tree using `f_down` and `f_up` closures. `f_down` is applied on a /// node while traversing the tree top-down (pre-order, before the node's children are /// visited) while `f_up` is applied on a node while traversing the tree bottom-up /// (post-order, after the the node's children are visited). /// /// The `f_down` closure takes /// - a `PD` type payload from its parent /// and returns a tuple made of: /// - a possibly modified node, /// - a `PC` type payload to pass to `f_up`, /// - a `Vec<FD>` type payload to propagate down to its children /// (one `FD` element is propagated down to each child), /// - a `TreeNodeRecursion` enum element to control recursion flow. /// /// The `f_up` closure takes /// - a `PC` type payload from `f_down` and /// - a `Vec<PU>` type payload collected from its children /// and returns a tuple made of: /// - a possibly modified node, /// - a `FU` type payload to propagate up to its parent, /// - a `TreeNodeRecursion` enum element to control recursion flow. fn transform_with_payload<FD, PD, PC, FU, PU>( self, f_down: &mut FD, payload_down: PD, f_up: &mut FU, ) -> Result<(Self, PU)> where FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, TreeNodeRecursion)>, FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, TreeNodeRecursion)>, ``` Obviously the closure return types can be extracted to a type alias or `struc`s like in the case of `f_down` could be: ```rust pub struct TransformDownResult<N, PD, PC> { pub transformed_node: N, pub payload_to_children: Vec<PD>, pub payload_to_f_up: PC, pub recursion_control: TreeNodeRecursion, } ``` All the other functions can then be turned into a specialization of the above: - `apply` (`visit_down`): Only `f_down` closure that doesn't return a modified node and payload (only retrurns `TreeNodeRecursion`) - `visit` / `TreeNodeVisitor`: Both `f_down` and `f_up` closures needed. The closures can be incorporated into a `TreeNodeVisitor` object but they don't return a modified node and payload - `transform_down`: Only `f_down` closure that doesn't return payload - `transform_up`: Only `f_up` closure that doesn't return payload - `transform`: Both `f_down` and `f_up` closures needed, but they don't return payloads - `rewrite` / `TreeNodeRewriter`: Both `f_down` and `f_up` are incorporated into a `TreeNodeRewriter` object, but they don't return a payload - `transform_down_with_payload`: Only `f_down` closure - `transform_up_with_payload`: Only `f_up` closure ### Describe alternatives you've considered As noted above, there are a variety of PRs that test out various ideas in one way or another and the major challenge I think is figuring out what changes to make, in what order, and how much code churn to apply) ### Additional context Now, as you mentioned some of these proposed changes have conflicting semantics to current APIs. Honestly I can't decide how much turbulance they would cause and so I'm seeking feedback from you and the community. But please note that I don't insist on any of the above functionaly or namings. These are just ideas and I can imagine implementations where: - we don't touch any existing API (maybe deprecate old ones) - or modify existing APIs to make them consistent, but also keep old API alternatives to ease the transition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
