peter-toth commented on PR #8891:
URL:
https://github.com/apache/arrow-datafusion/pull/8891#issuecomment-1900138152
Thanks for the question @alamb! I'm new to DataFusion so this discussion
could help me a lot to understand what ways are viable in terms of improving
the existing code. I came from a different open-source query engine but I think
`TreeNode` APIs are fundamental for each engine. While I was getting familiar
with DataFusion I stumbled upon quite a few inconsistencies and missing
features regarding these APIs. I was trying to throw in some ideas with my PRs
that could help improving the situation. Some of these PRs are simply trying to
improve consistency some others are performance related.
Here are a few things I found weird with the current API that I think can
cause a newcommer (like me) quite some confusion:
- The `apply` / `visit` functions are capable to prune (skip) subtrees or
return immediately (stop) but the `transform` functions are not. To make it
more confusion `rewrite` is also capable to to prune, but instead of using a
common way to control recursion, `visit` and `rewrite` use different enums with
conflicting semantics.
See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for
details on `VisitRecursion` vs `RewriteRecursion`.
- There is this `Transformed` enum whose purpose is to explicitely define if
any change was made to the tree. This enum is used in transform functions but
missing from rewrite. Moreover this explicit information is neither propogatged
up to callee nor it is used for detecting changes.
See details in https://github.com/apache/arrow-datafusion/pull/8835. I
believe the current state simply doesn't make sense and just causes confusion.
- `rewrite` also seems to be inconsistent with `transform_up` and
`transform_down`. `rewrite` probably organically ended up in its current state
and capable to do its current job, but what I would expect from such a function
is that it simply incorporates 2 closures from `transform_up` and
`transform_down` to define:
- what transformation should be done before (pre-order) and after
(post-order) transformation of the node's children
- and how the recursion should continue .
See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for the
details. IMO the current `TreeNodeRewriter.pre_visit()` seems like a mess and
so hard to grasp.
What features I miss from the current API:
- Pruning capability of `transform` functions:
Pruning would be very important as many of the tranformations wouldn't
require traversing on the whole tree and could improve performance a lot.
- Payload propagation:
I think the the reason why there are so many (4 base + 9 derived) tree
implementations (with many code duplications) in DataFusion is that there is no
API to describe a transformation while also propagating state. In
https://github.com/apache/arrow-datafusion/pull/8664 with the help of
`transform_with_payload` I not just eleminated 7 derived trees but also
improved some of the algorightms to require only one traversal (also a perf
improvement).
So to recap my ideal `TreeNode` transformation API would look like this:
```rust
/// Transforms the tree using `f_down` and `f_up` closures. `f_down` is
applied on a
/// node while traversing the tree top-down (pre-order, before the
node's children are
/// visited) while `f_up` is applied on a node while traversing the tree
bottom-up
/// (post-order, after the the node's children are visited).
///
/// The `f_down` closure takes
/// - a `PD` type payload from its parent
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `PC` type payload to pass to `f_up`,
/// - a `Vec<FD>` type payload to propagate down to its children
/// (one `FD` element is propagated down to each child),
/// - a `TreeNodeRecursion` enum element to control recursion flow.
///
/// The `f_up` closure takes
/// - a `PC` type payload from `f_down` and
/// - a `Vec<PU>` type payload collected from its children
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `FU` type payload to propagate up to its parent,
/// - a `TreeNodeRecursion` enum element to control recursion flow.
fn transform_with_payload<FD, PD, PC, FU, PU>(
self,
f_down: &mut FD,
payload_down: PD,
f_up: &mut FU,
) -> Result<(Self, PU)>
where
FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC,
TreeNodeRecursion)>,
FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU,
TreeNodeRecursion)>,
```
Obviously the closure return types can be extracted to a type alias or
`struc`s like in the case of `f_down` could be:
```rust
pub struct TransformDownResult<N, PD, PC> {
pub transformed_node: N,
pub payload_to_children: Vec<PD>,
pub payload_to_f_up: PC,
pub recursion_control: TreeNodeRecursion,
}
```
All the other functions should be just specialization of the above:
- `apply` (`visit_down`): Only `f_down` closure that doesn't return a
modified node and payload (only retrurns `TreeNodeRecursion`)
- `visit` / `TreeNodeVisitor`: Both `f_down` and `f_up` closures needed. The
closures can be incorporated into a `TreeNodeVisitor` object but they don't
return a modified node and payload
- `transform_down`: Only `f_down` closure that doesn't return payload
- `transform_up`: Only `f_up` closure that doesn't return payload
- `transform`: Both `f_down` and `f_up` closures needed, but they don't
return payloads
- `rewrite` / `TreeNodeRewriter`: Both `f_down` and `f_up` are incorporated
into a `TreeNodeRewriter` object, but they don't return a payload
- `transform_down_with_payload`: Only `f_down` closure
- `transform_up_with_payload`: Only `f_up` closure
I believe all these PRs would take us forward to a simple but still
versarile APIs that have/are:
- performance benefits (e.g. pruning during transformation + improved
algorithms in https://github.com/apache/arrow-datafusion/pull/8664),
- reduced code maintenance costs (e.g. see less dereived trees in
https://github.com/apache/arrow-datafusion/pull/8664)
- consistent and easy to grasp (e.g. this PR +
https://github.com/apache/arrow-datafusion/pull/8835)
Now, as you mentioned some of these proposed changes have conflicting
semantics to current APIs. Honestly I can't decide how much turbulance they
would cause and so I'm seeking feedback from you and the community. But please
note that I don't insist on any of the above functionaly or namings. These are
just ideas and I can imagine implementations where:
- we don't touch any existing API (maybe deprecate old ones)
- or modify existing APIs to make them consistent, but also keep old API
alternatives to ease the transition.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]