Re: [PR] Switch to iterative `DynNode` and `ConcreteTreeNode` processing [datafusion]

via GitHub Sat, 02 Nov 2024 03:01:14 -0700


peter-toth commented on PR #13177:
URL: https://github.com/apache/datafusion/pull/13177#issuecomment-2452940289

Thank you @blaginin for improving the PR and pointing me to those
conversations. Unfortunately I missed those and wasn't aware that there have
been an attempt to add stack growth libraries.

I followed the commits yesterday and saw how each of them made the
implementation better and better. I could only suggest minor changes like
`TransformingState::ProcessingChildren` probably doesn't need to 2 separate
`vec`s, just an index to track which child nodes have been transformed.

But, on the other hand, I also looked into `stacker` and `recursive`.
`stacker` seems like a very mature crate that has been used in many projects,
while `recursive` is just a convenience layer over `stacker` to allow using it
even simpler.

Honestly, I don't share the concerns about these libraries. `stacker` is
owned by the rust-lang team and even the Rust compiler itself uses it:
https://github.com/rust-lang/rust/blob/master/compiler/rustc_data_structures/src/stack.rs#L14-L22.
`recursive` seems to be created by the Polars guys, and looks like almost only
they use it.

By running some benchmarks I was referring to running the old vs new APIs on
large trees with many nodes. IMO the planning benchmarks tell a lot about
everyday queries, but doesn't contain any extreme cases with very large trees.

I forked your branch yesterday:
https://github.com/peter-toth/datafusion/commits/recursive-vs-iterative/ (not
the latest state, but after the visit improvement) and in 3 commits separated
the old (recursive) and new (iterative) `visit` API implementations + added
`recursive` to the old + added a small benchmark to just quickly test the old
vs new `visit` API on tall and wide `Arc<DynTestNode>` trees:

```
% cargo bench --bench tree_node

visit old tall tree time: [377.79 µs 380.87 µs 385.39 µs]
change: [-3.2937% -2.2943% -1.1798%] (p = 0.00 <
0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
8 (8.00%) low mild
5 (5.00%) high severe

visit new tall tree time: [349.62 µs 351.73 µs 353.89 µs]
change: [-3.9519% -3.2991% -2.6432%] (p = 0.00 <
0.05)
Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
5 (5.00%) low severe
2 (2.00%) low mild
8 (8.00%) high mild
6 (6.00%) high severe

Benchmarking visit old wide tree: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 86.0s, or reduce sample count to 10.
visit old wide tree time: [875.28 ms 879.02 ms 882.86 ms]
change: [-1.2651% -0.7020% -0.0989%] (p = 0.02 <
0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

Benchmarking visit new wide tree: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 93.1s, or reduce sample count to 10.
visit new wide tree time: [913.97 ms 915.78 ms 917.58 ms]
change: [-0.0531% +0.3441% +0.6875%] (p = 0.07 >
0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
```

It looks like the 2 implementations are comparable. Maybe the recursive
approach + `stacker` is a bit faster when it comes to large trees (879 vs 915
ms). (And as we discussed most likely `Expr` is where the iterative approach
would suffer...)

But overall, I feel I'm still leaning tovards `stacker` for a few reasons:
- On recursive data structures the recursive algorithms are natural, I find
it much easier to follow the old API implementation than the new one.
- I have no concerns with `stacker` / `recursive` and they seem to provide a
general way to deal with deeply nested recursive calls. Also, these crates can
be useful in other areas than `TreeNode`s.
- The `apply` / `visit` / `transform` / `rewrite` APIs are the most common
ones to process trees, but let's not forget about rules that self organize
their tree recursion via `apply_children` / `map_children` in a recursive
manner. I do like that we provide these lower level APIs for unique usecases
while they can keep the natural recursive implementations.
- As we discussed there are trees like `Expr` where the iterative approach
would be much clumsier. Sure, we could a deconstruct a node and store its
reconsturction closure on the stack and then reconstruct it once we have
transformed their children, but I feel the current `map_children` is so much
cleaner in this case.
- There could be extreme cases with `TreeNode`s as well. I'm thinking of
`_with_subqueries()` methods of `LogicalPlan` in which we switch between 2
kinds of trees (`LogicalPlan` / `Expr`) as many times as needed. So stack
overflow can happen due to multiple different trees are nested into each other
and this can be trivially handled using `stacker`, but you would need to store
multiple kinds of `TreeNode`s on the stack, which is again not that
straightforward.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Switch to iterative `DynNode` and `ConcreteTreeNode` processing [datafusion]

Reply via email to