This is a very fundamental use case in Flink. And in Flink, we have a
mechanism that breaks the DAG into sub trees[1] and then utilizes
Calcite to do the optimization of each sub tree. This prevents many
optimization chances across sub trees, such as field trimming, filter
pushdown, and many more transposing rules to utilize cost-based
optimization.

The benefit we get via that way is that we can avoid many complex
optimization corner cases since sub trees are simpler than the whole
DAG, we even hacked to utilize this non formal character to work
around some planner corner cases (could not produce a valid plan /
optimization dead loops) temporarily before we finally solved the root
cause.

Solving this problem in Calcite sounds good to me, the only one
concern is the optimization complexity (I'm not sure how you planned
to do it yet, it's just a heads up from my experience). I know it's
more related to how users choose the rule sets and the complexity of
queries, however it may be one of the factors that affects the
usability of multi-query optimization.

[1] 
https://github.com/apache/flink/blob/8e8e2d649e98071bc15ee768cf65da3e96f255b4/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/optimize/CommonSubGraphBasedOptimizer.scala#L58

Forward Xu <[email protected]> 于2024年1月4日周四 10:51写道:
>
> Thanks to the community for the related work that has been done, This is a
> good discussion,  and I think it can be used in, for example, better
> multi-query optimization algorithms, more effective query result caching
> strategies, smarter materialized view selection and maintenance methods,
> etc. These directions require further research and exploration.
>
> Finally here is a related paper MULTI-QUERY OPTIMIZATION AND APPLICATIONS]
> <https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2015/2014/Papers/prasan-thesis.pdf>
> .
>
> Best,
> ForwardXu
>
> Julian Hyde <[email protected]> 于2024年1月4日周四 04:09写道:
>
> > Multi-query optimization is one of the big challenges of our field.
> > Examples of multi-queries:
> >  * an INSERT statement that writes into a table but also updates an index,
> >  * a DAG that represents an ETL/ELT job;
> >  * a query that produces several data sets (say a list of invoices for
> > orders and a list of products that need to be restocked);
> >  * a query that uses intermediate results more than once.
> >
> > Common features of multi-queries are multiple output data sets, and
> > re-used intermediate results. In other words, the dataflow graph is a
> > DAG rather than just a tree.
> >
> > I think it would be useful to frame the problem by extending SQL so
> > that such multi-queries can be represented as a single unit. From that
> > would follow extensions to relational algebra, and improvements to
> > planner algorithms and cost models.
> >
> > I would like to hear people's thoughts before I log a Jira case with a
> > sketch of the problem.
> >
> > Related work:
> >  * https://issues.apache.org/jira/browse/CALCITE-481 Spool operator
> >  * https://issues.apache.org/jira/browse/CALCITE-1440 Multiple RelNodes
> >  * https://issues.apache.org/jira/browse/CALCITE-4568 Incremental
> > query optimization
> >  * https://issues.apache.org/jira/browse/CALCITE-129 Recursive queries
> >
> > Julian
> >



-- 

Best,
Benchao Li

Reply via email to