alamb opened a new issue, #8913:
URL: https://github.com/apache/arrow-datafusion/issues/8913

   ### Is your feature request related to a problem or challenge?
   
   (note this is mostly from @peter-toth 's description on 
https://github.com/apache/arrow-datafusion/pull/8891#issuecomment-1900138152)
   
   As @peter-toth noted, the current `TreeNode` APIs are somewhat inconsistent  
which has the following challenges:
   1. adds a barrier to entry for newcomers to the DataFusion codebase
   3. The APIs sometimes can not (or do no) implement pruning / early tree 
traversal termination and therefore do more tree walking / transforming than 
necessary 
   2. Increases code maintenance costs 
   
   
   ### Specific Challenges of the Current APIs
   Some specific challenges with the current APIs. These can cause a newcommer 
(like me) some confusion:
   - The `apply` / `visit` functions are capable to prune (skip) subtrees or 
return immediately (stop) but the `transform` functions are not. To make it 
more confusion `rewrite` is also capable to to prune, but instead of using a 
common way to control recursion, `visit` and `rewrite` use different enums with 
conflicting semantics.
      See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for 
details on `VisitRecursion` vs `RewriteRecursion`.
   - There is this `Transformed` enum whose purpose is to explicitely define if 
any change was made to a node. This enum is used in `transform` clousres but 
for some reason it is missing from `rewrite`. Moreover this explicit 
information is neither propogatged up to API callee nor it is used for 
detecting changes in nodes.
      See details in https://github.com/apache/arrow-datafusion/pull/8835. I 
believe the current state simply doesn't make sense and just causes confusion.
   - `rewrite` also seems to be inconsistent with `transform_up` and 
`transform_down`. `rewrite` probably organically ended up in its current state 
and capable to do its current job, but what I would expect from such a function 
is that it simply incorporates 2 closures from `transform_up` and 
`transform_down` to define:
     - what transformation should be done before (pre-order) and after 
(post-order) transformation of the node's children
     - and how the recursion should continue .
     
     See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for the 
details. IMO the current `TreeNodeRewriter.pre_visit()` seems like a mess and 
its logic is hard to grasp.
   
   
   
   ### Specific missing APIs
   - Pruning capability of `transform` functions:
      Pruning would be very important as many of the transformations wouldn't 
require traversing on the whole tree and could improve performance a lot.
   - Payload propagation:
      I think the the reason why there are so many (4 base + 9 derived) tree 
implementations (with many code duplications) in DataFusion is that there is no 
API to describe a transformation while also propagating state. In 
https://github.com/apache/arrow-datafusion/pull/8664 with the help of 
`transform_with_payload` I not just eleminated 7 derived trees but also 
improved some of the algorightms to require only one traversal (also a perf 
improvement).
   
   
   ### Describe the solution you'd like
   
   The proposal is a general API like this:
   
   ```rust
       /// Transforms the tree using `f_down` and `f_up` closures. `f_down` is 
applied on a
       /// node while traversing the tree top-down (pre-order, before the 
node's children are
       /// visited) while `f_up` is applied on a node while traversing the tree 
bottom-up
       /// (post-order, after the the node's children are visited).
       ///
       /// The `f_down` closure takes
       /// - a `PD` type payload from its parent
       /// and returns a tuple made of:
       /// - a possibly modified node,
       /// - a `PC` type payload to pass to `f_up`,
       /// - a `Vec<FD>` type payload to propagate down to its children
       ///    (one `FD` element is propagated down to each child),
       /// - a `TreeNodeRecursion` enum element to control recursion flow.
       ///
       /// The `f_up` closure takes
       /// - a `PC` type payload from `f_down` and
       /// - a `Vec<PU>` type payload collected from its children
       /// and returns a tuple made of:
       /// - a possibly modified node,
       /// - a `FU` type payload to propagate up to its parent,
       /// - a `TreeNodeRecursion` enum element to control recursion flow.
       fn transform_with_payload<FD, PD, PC, FU, PU>(
           self,
           f_down: &mut FD,
           payload_down: PD,
           f_up: &mut FU,
       ) -> Result<(Self, PU)>
       where
           FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, 
TreeNodeRecursion)>,
           FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, 
TreeNodeRecursion)>,
   ```
   
   
   
   Obviously the closure return types can be extracted to a type alias or 
`struc`s like in the case of `f_down` could be:
   ```rust
   pub struct TransformDownResult<N, PD, PC> {
       pub transformed_node: N,
       pub payload_to_children: Vec<PD>,
       pub payload_to_f_up: PC,
       pub recursion_control: TreeNodeRecursion,
   }
   ```
   
   All the other functions can then be turned into a specialization of the 
above:
   - `apply` (`visit_down`): Only `f_down` closure that doesn't return a 
modified node and payload (only retrurns `TreeNodeRecursion`)
   - `visit` / `TreeNodeVisitor`: Both `f_down` and `f_up` closures needed. The 
closures can be incorporated into a `TreeNodeVisitor` object but they don't 
return a modified node and payload
   - `transform_down`: Only `f_down` closure that doesn't return payload
   - `transform_up`: Only `f_up` closure that doesn't return payload
   - `transform`: Both `f_down` and `f_up` closures needed, but they don't 
return payloads
   - `rewrite` / `TreeNodeRewriter`: Both `f_down` and `f_up` are incorporated 
into a `TreeNodeRewriter` object, but they don't return a payload
   - `transform_down_with_payload`: Only `f_down` closure
   - `transform_up_with_payload`: Only `f_up` closure
   
   
   ### Describe alternatives you've considered
   
   As noted above, there are a variety of PRs that test out various ideas in 
one way or another and the major challenge I think is figuring out what changes 
to make, in what order, and how much code churn to apply)
   
   
   
   ### Additional context
   
   
   Now, as you mentioned some of these proposed changes have conflicting 
semantics to current APIs. Honestly I can't decide how much turbulance they 
would cause and so I'm seeking feedback from you and the community. But please 
note that I don't insist on any of the above functionaly or namings. These are 
just ideas and I can imagine implementations where:
   - we don't touch any existing API (maybe deprecate old ones)
   - or modify existing APIs to make them consistent, but also keep old API 
alternatives to ease the transition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to