peter-toth commented on PR #13177: URL: https://github.com/apache/datafusion/pull/13177#issuecomment-2452940289
Thank you @blaginin for improving the PR and pointing me to those conversations. Unfortunately I missed those and wasn't aware that there have been an attempt to add stack growth libraries. I followed the commits yesterday and saw how each of them made the implementation better and better. I could only suggest minor changes like `TransformingState::ProcessingChildren` probably doesn't need to 2 separate `vec`s, just an index to track which child nodes have been transformed. But, on the other hand, I also looked into `stacker` and `recursive`. `stacker` seems like a very mature crate that has been used in many projects, while `recursive` is just a convenience layer over `stacker` to allow using it even simpler. Honestly, I don't share the concerns about these libraries. `stacker` is owned by the rust-lang team and even the Rust compiler itself uses it: https://github.com/rust-lang/rust/blob/master/compiler/rustc_data_structures/src/stack.rs#L14-L22. `recursive` seems to be created by the Polars guys, and looks like almost only they use it. By running some benchmarks I was referring to running the old vs new APIs on large trees with many nodes. IMO the planning benchmarks tell a lot about everyday queries, but doesn't contain any extreme cases with very large trees. I forked your branch yesterday: https://github.com/peter-toth/datafusion/commits/recursive-vs-iterative/ (not the latest state, but after the visit improvement) and in 3 commits separated the old (recursive) and new (iterative) `visit` API implementations + added `recursive` to the old + added a small benchmark to just quickly test the old vs new `visit` API on tall and wide `Arc<DynTestNode>` trees: ``` % cargo bench --bench tree_node visit old tall tree time: [377.79 µs 380.87 µs 385.39 µs] change: [-3.2937% -2.2943% -1.1798%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) low mild 5 (5.00%) high severe visit new tall tree time: [349.62 µs 351.73 µs 353.89 µs] change: [-3.9519% -3.2991% -2.6432%] (p = 0.00 < 0.05) Performance has improved. Found 21 outliers among 100 measurements (21.00%) 5 (5.00%) low severe 2 (2.00%) low mild 8 (8.00%) high mild 6 (6.00%) high severe Benchmarking visit old wide tree: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 86.0s, or reduce sample count to 10. visit old wide tree time: [875.28 ms 879.02 ms 882.86 ms] change: [-1.2651% -0.7020% -0.0989%] (p = 0.02 < 0.05) Change within noise threshold. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe Benchmarking visit new wide tree: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 93.1s, or reduce sample count to 10. visit new wide tree time: [913.97 ms 915.78 ms 917.58 ms] change: [-0.0531% +0.3441% +0.6875%] (p = 0.07 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` It looks like the 2 implementations are comparable. Maybe the recursive approach + `stacker` is a bit faster when it comes to large trees (879 vs 915 ms). (And as we discussed most likely `Expr` is where the iterative approach would suffer...) But overall, I feel I'm still leaning tovards `stacker` for a few reasons: - On recursive data structures the recursive algorithms are natural, I find it much easier to follow the old API implementation than the new one. - I have no concerns with `stacker` / `recursive` and they seem to provide a general way to deal with deeply nested recursive calls. Also, these crates can be useful in other areas than `TreeNode`s. - The `apply` / `visit` / `transform` / `rewrite` APIs are the most common ones to process trees, but let's not forget about rules that self organize their tree recursion via `apply_children` / `map_children` in a recursive manner. I do like that we provide these lower level APIs for unique usecases while they can keep the natural recursive implementations. - As we discussed there are trees like `Expr` where the iterative approach would be much clumsier. Sure, we could a deconstruct a node and store its reconsturction closure on the stack and then reconstruct it once we have transformed their children, but I feel the current `map_children` is so much cleaner in this case. - There could be extreme cases with `TreeNode`s as well. I'm thinking of `_with_subqueries()` methods of `LogicalPlan` in which we switch between 2 kinds of trees (`LogicalPlan` / `Expr`) as many times as needed. So stack overflow can happen due to multiple different trees are nested into each other and this can be trivially handled using `stacker`, but you would need to store multiple kinds of `TreeNode`s on the stack, which is again not that straightforward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
