I didn’t actually have an algorithm in mind. I wanted to kick off a research project, by saying ’This is a common problem, and here is a language for describing it. Does anyone have ideas for how to solve it?’.
The issue keeps coming up, and I want to be able to say ‘Yes, we’ve been thinking about that in Calcite’. So, I’ll go ahead and log that Jira case. I’d welcome improvements — better SQL syntax, better set of use cases — from people who have been working in this space. I hadn’t thought about field trimming across trees, but that sounds very important and useful. I suspect that the Volcano approach will get us 80% of the way there. Dynamic programming doesn’t really care whether we are working on a tree, a forest, or a collection of trees loosely connected into a DAG. The only problem — as noted in https://issues.apache.org/jira/browse/CALCITE-481 — is that Volcano’s cost model will tend to double-count. But my goal isn’t to solve the problem. Once we have a good set of use cases (pure query, pure DML, hybrid query+DML, cyclic data flows) expressed in SQL, we can start discussing algorithms. > On Jan 3, 2024, at 9:25 PM, Benchao Li <[email protected]> wrote: > > This is a very fundamental use case in Flink. And in Flink, we have a > mechanism that breaks the DAG into sub trees[1] and then utilizes > Calcite to do the optimization of each sub tree. This prevents many > optimization chances across sub trees, such as field trimming, filter > pushdown, and many more transposing rules to utilize cost-based > optimization. > > The benefit we get via that way is that we can avoid many complex > optimization corner cases since sub trees are simpler than the whole > DAG, we even hacked to utilize this non formal character to work > around some planner corner cases (could not produce a valid plan / > optimization dead loops) temporarily before we finally solved the root > cause. > > Solving this problem in Calcite sounds good to me, the only one > concern is the optimization complexity (I'm not sure how you planned > to do it yet, it's just a heads up from my experience). I know it's > more related to how users choose the rule sets and the complexity of > queries, however it may be one of the factors that affects the > usability of multi-query optimization. > > [1] > https://github.com/apache/flink/blob/8e8e2d649e98071bc15ee768cf65da3e96f255b4/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/optimize/CommonSubGraphBasedOptimizer.scala#L58 > > Forward Xu <[email protected]> 于2024年1月4日周四 10:51写道: >> >> Thanks to the community for the related work that has been done, This is a >> good discussion, and I think it can be used in, for example, better >> multi-query optimization algorithms, more effective query result caching >> strategies, smarter materialized view selection and maintenance methods, >> etc. These directions require further research and exploration. >> >> Finally here is a related paper MULTI-QUERY OPTIMIZATION AND APPLICATIONS] >> <https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2015/2014/Papers/prasan-thesis.pdf> >> . >> >> Best, >> ForwardXu >> >> Julian Hyde <[email protected]> 于2024年1月4日周四 04:09写道: >> >>> Multi-query optimization is one of the big challenges of our field. >>> Examples of multi-queries: >>> * an INSERT statement that writes into a table but also updates an index, >>> * a DAG that represents an ETL/ELT job; >>> * a query that produces several data sets (say a list of invoices for >>> orders and a list of products that need to be restocked); >>> * a query that uses intermediate results more than once. >>> >>> Common features of multi-queries are multiple output data sets, and >>> re-used intermediate results. In other words, the dataflow graph is a >>> DAG rather than just a tree. >>> >>> I think it would be useful to frame the problem by extending SQL so >>> that such multi-queries can be represented as a single unit. From that >>> would follow extensions to relational algebra, and improvements to >>> planner algorithms and cost models. >>> >>> I would like to hear people's thoughts before I log a Jira case with a >>> sketch of the problem. >>> >>> Related work: >>> * https://issues.apache.org/jira/browse/CALCITE-481 Spool operator >>> * https://issues.apache.org/jira/browse/CALCITE-1440 Multiple RelNodes >>> * https://issues.apache.org/jira/browse/CALCITE-4568 Incremental >>> query optimization >>> * https://issues.apache.org/jira/browse/CALCITE-129 Recursive queries >>> >>> Julian >>> > > > > -- > > Best, > Benchao Li
