Re: [I] Epic: Unified TreeNode rewrite API [arrow-datafusion]

via GitHub Sat, 20 Jan 2024 00:34:40 -0800


peter-toth commented on issue #8913:
URL: 
https://github.com/apache/arrow-datafusion/issues/8913#issuecomment-1901951649


   > The main concern I have is composability. Downstream, we have many use 
cases where the flow goes something like this:
   > 
   > 1. Visit/transform a given tree, and save per-node data during the process.
   > 2. Make a decision based depending on the final result of the visitation.
   > 3. Visit/transform the tree again, re-use the saved per-node data and only 
modify when necessary.
   > 
   > (this can repeat multiple times)
   
   This is actually interresting use case and I don't see that requirement in 
the current codebase. Are you adding this optimization in 
https://github.com/apache/arrow-datafusion/pull/8817 to somewhere?
   Also, I'm not sure how often is this required so it might not be a good idea 
to transform all usecases to this direction.
   
   > #8817 works towards having a composable architecture that lends itself to 
such flows, and will eliminate most major sources of code duplication 
concerning `TreeNode`. Similar to how `DynTreeNode` works, it introduces one 
generic object for `ExecutionPlan`-related trees, and one generic object for 
`PhysicalExpr`-related trees.
   
   I don't think that code duplication is the major issue here. I think the 
main issue with the current 11 derived trees (and also with the proposed 
`PlanContext` and `ExprContext`) is that the algorithms they use need to 
maintain both the base tree and the derived tree and keeping them in sync. Also 
the current closures (and https://github.com/apache/arrow-datafusion/pull/8817 
too) are mutating the encoded state in tree nodes so it is very hard to reason 
about the correctness of the algorithms.
   IMO in most of the cases (I checked only 9 out of 11 in the current codebase 
in https://github.com/apache/arrow-datafusion/pull/8664) this is simply not 
neded. In my `transform_with_payload` solution the immutable state of nodes, 
that is propagated down and up during transformation, and the clear transition 
functions encoded into closures give much cleaner and simle algorithms.
   
   What I would propose is proabaly to take a look at 
https://github.com/apache/arrow-datafusion/pull/8664 first (it is a much 
simpler PR than https://github.com/apache/arrow-datafusion/pull/8817 is now) 
and check where the new API can help. If there remain any usecase where 
`transform_with_payload` doesn't prove to be useful there is sill the option to 
do the `PlanContext` / `ExprContext` way of refactor.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Epic: Unified TreeNode rewrite API [arrow-datafusion]

Reply via email to