findepi commented on PR #13618: URL: https://github.com/apache/datafusion/pull/13618#issuecomment-2511141746
> `TreeNode` traversal APIs are crucial parts of query plan analysis and optimzation. This PR explores the idea of storing some pre-calculatd statistics/properties nodes (or subtrees) inside the nodes during creation and automatically update the stats during transfomations. I love the idea in principle (vide https://github.com/apache/datafusion/issues/12604) There are however some important questions - does the party creating given plan component know all the values we could want to eventually infer from the plan? - statistics is a great example. statistics retrieval for a table scan can be expensive. statistics inference for a filter or join is actually complicated. the algorithm for doing that is not part of the plan, and the call site creating the plan shouldn't need to know what the algorithm is - lazy vs eager. statistics is a great example again. In data lake systems like Iceberg or Delta, statistics retrieval can be expensive process (eg in Iceberg case may involve traversing manifest files). for trivial `SELECT .. FROM t [LIMIT n]` queries it is not needed (no logic is stats dependent). for complex queries, it should be run _after_ predicate pushdown, thus allowing for cheaper (and more accurae) stats retrieval. - decoupled. the plan creating and attributes derivation should not be coupled, since they are two different parts of the system. - backwards compatibility. the LogicalPlan nodes are datafusion's frontend (SQL frontend, dataframe's, or created separate by downstream projects). breaking this layer can have dire consequences. proving a way to easily construct LogicalPlan nodes without attributes needed later on could turn the effort void. Those observations lead to conclusions on the design what should and shouldn't be part of a plan (logical or physical): - there are types of additional information that could (or maybe even should) be added to the plan (https://github.com/apache/datafusion/issues/12604). - there are types of additional information which feel more attributes of the optimizer itself -- from design perspective they should better be kept off the plan - more advanced optimizers may want to derive some additional information about the plan node, or group of plan nodes separately from plan creation (for example if an optimizer creates alternative LPs that compute same relation, they stats for that relation don't need to be computed more than once, even if we don't know the stats for intermediate nodes) - value-based immutable data structures is a really really good thing for engineering sanity. in short-term it prevents _adding_ information to the plan nodes during optimization process - but a more advanced optimizer could want to break plan immutability anyway, to support "plan groups" / alternative plans for given sub-relation, or simply to be able to move a sub-part of the plan into an optimizer rule, without need to destructure the whole plan To me, the best way to address these various needs would be - creation of a new IR plan structure - with new expressions to support types https://github.com/apache/datafusion/issues/12604 - with much less verbose expression language (like we don't need IS NULL vs IS UNKOWN) - with clear separation of sorting ()https://github.com/apache/datafusion/pull/13258), aliases (https://github.com/apache/datafusion/issues/1468), aggregate vs scalar functions, etc - remodelling aliases would also solve https://github.com/apache/datafusion/issues/6543 / https://github.com/apache/datafusion/issues/13476 (see discussion in https://github.com/apache/datafusion/pull/13489) - this feels key block of https://github.com/apache/datafusion/pull/13258 - creation of a new optimizer which - would optimize plans favoring optimization locallity (eg if one optimizer rule changed something around plan leaf node, the next optimizer rule should start at that place, rather than from the plan root, cause the rest of the plan was already optimized) - would allow tracking additional information about plan nodes or plan groups (Trino's Memo is a good analogy here, https://github.com/trinodb/trino/blob/f83854687229df1ae3325249e7bcd29f03d0943c/core/trino-main/src/main/java/io/trino/sql/planner/iterative/Memo.java#L247-L265) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org