findepi commented on PR #13618:
URL: https://github.com/apache/datafusion/pull/13618#issuecomment-2511141746

   > `TreeNode` traversal APIs are crucial parts of query plan analysis and 
optimzation. This PR explores the idea of storing some pre-calculatd 
statistics/properties nodes (or subtrees) inside the nodes during creation and 
automatically update the stats during transfomations.
   
   I love the idea in principle (vide 
https://github.com/apache/datafusion/issues/12604)
   
   There are however some important questions
   
   - does the party creating given plan component know all the values we could 
want to eventually infer from the plan?
     - statistics is a great example. statistics retrieval for a  table scan  
can be expensive. statistics inference for a filter or join is actually 
complicated. the algorithm for doing that is not part of the plan, and the call 
site creating the plan shouldn't need to know what the algorithm is
     - lazy vs eager. statistics is a great example again. In data lake systems 
like Iceberg or Delta, statistics retrieval can be expensive process (eg in 
Iceberg case may involve traversing manifest files). for trivial `SELECT .. 
FROM t [LIMIT n]` queries it is not needed (no logic is stats dependent). for 
complex queries, it should be run _after_ predicate pushdown, thus allowing for 
cheaper (and more accurae) stats retrieval.
     - decoupled. the plan creating and attributes derivation should not be 
coupled, since they are two different parts of the system.
   - backwards compatibility. the LogicalPlan nodes are datafusion's frontend 
(SQL frontend, dataframe's, or created separate by downstream projects). 
breaking this layer can have dire consequences. proving a way to easily 
construct LogicalPlan nodes without attributes needed later on could turn the 
effort void.
   
   Those observations lead to conclusions on the design what should and 
shouldn't be part of a plan (logical or physical):
   
   - there are types of additional information that could (or maybe even 
should) be added to the plan 
(https://github.com/apache/datafusion/issues/12604).
   - there are types of additional information which feel more attributes of 
the optimizer itself -- from design perspective they should better be kept off 
the plan
   - more advanced optimizers may want to derive some additional information 
about the plan node, or group of plan nodes separately from plan creation (for 
example if an optimizer creates alternative LPs that compute same relation, 
they stats for that relation don't need to be computed more than once, even if 
we don't know the stats for intermediate nodes)
   - value-based immutable data structures is a really really good thing for 
engineering sanity. in short-term it prevents _adding_ information to the plan 
nodes during optimization process
     - but a more advanced optimizer could want to break plan immutability 
anyway, to support "plan groups" / alternative plans for given sub-relation, or 
simply to be able to move a sub-part of the plan into an optimizer rule, 
without need to destructure the whole plan
   
   To me, the best way to address these various needs would be
   
   - creation of a new IR plan structure
     - with new expressions to support types 
https://github.com/apache/datafusion/issues/12604
     - with much less verbose expression language (like we don't need IS NULL 
vs IS UNKOWN)
     - with clear separation of sorting 
()https://github.com/apache/datafusion/pull/13258), aliases 
(https://github.com/apache/datafusion/issues/1468), aggregate vs scalar 
functions,  etc
     - remodelling aliases would also solve 
https://github.com/apache/datafusion/issues/6543 / 
https://github.com/apache/datafusion/issues/13476 (see discussion in 
https://github.com/apache/datafusion/pull/13489)
     - this feels key block of https://github.com/apache/datafusion/pull/13258
   
   - creation of a new optimizer which
     - would optimize plans favoring optimization locallity (eg if one 
optimizer rule changed something around plan leaf node, the next optimizer rule 
should start at that place, rather than from the plan root, cause the rest of 
the plan was already optimized)
     - would allow tracking additional information about plan nodes or plan 
groups (Trino's Memo is a good analogy here, 
https://github.com/trinodb/trino/blob/f83854687229df1ae3325249e7bcd29f03d0943c/core/trino-main/src/main/java/io/trino/sql/planner/iterative/Memo.java#L247-L265)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to