This would be an exciting extension. I have experience using Calcite as a backend for a visualization platform, and we had to develop similar functionality on top, as it was essential for many optimizations we had in mind.
Here's a quick outline of our use cases: - Re-use of intermediate results. Two flavors here: -- A common part of a request is extracted, and the data is cached in the scope of one query. This could have probably been implemented using spool, but we did it using a custom optimization program. -- One query is pushed down into the DB, and then another query is run on top of that result and produces the final resultset. This was probably the most frequently used optimization. The first query did partial aggregation in the DB, which was very beneficial for performance. - Using the results of one query in another query. There were numerous posts on the mailing list about building a dynamic filter for pushdown. We did this by simply running several queries sequentially. For us, the benefit of offloading both queries to the DB outweighed the drawback of possibly scanning the same data twice. - Re-using intermediate result(again), but returning several resultsets. Doing several levels of aggregation on top of raw data(or the smallest aggregation level). - Running several independent queries on top of the same dataset. It is probably not much of an optimization from the DB standpoint (if we ignore all the smart multi-query optimizations, which I think not too many DBs actually currently have). However, it had some minor communication-level improvements for us as a data layer of the product. This was all implemented on top of Calcite, by building rel trees of different complexity. Some of that logic was abstracted into our "Statement" object; others were just pieces of logic during Rel tree building. But I think most of those optimizations could be implemented in Calcite, if there are enough extension points to plug in custom logic. There's probably nothing new here, but I hope this confirms such functionality's usefulness. Would be happy to help with the design/implementation. On Thu, Jan 4, 2024 at 10:47 PM Julian Hyde <[email protected]> wrote: > I didn’t actually have an algorithm in mind. I wanted to kick off a > research project, by saying ’This is a common problem, and here is a > language for describing it. Does anyone have ideas for how to solve it?’. > > The issue keeps coming up, and I want to be able to say ‘Yes, we’ve been > thinking about that in Calcite’. > > So, I’ll go ahead and log that Jira case. I’d welcome improvements — > better SQL syntax, better set of use cases — from people who have been > working in this space. > > I hadn’t thought about field trimming across trees, but that sounds very > important and useful. > > I suspect that the Volcano approach will get us 80% of the way there. > Dynamic programming doesn’t really care whether we are working on a tree, a > forest, or a collection of trees loosely connected into a DAG. The only > problem — as noted in https://issues.apache.org/jira/browse/CALCITE-481 — > is that Volcano’s cost model will tend to double-count. > > But my goal isn’t to solve the problem. Once we have a good set of use > cases (pure query, pure DML, hybrid query+DML, cyclic data flows) expressed > in SQL, we can start discussing algorithms. > > > > On Jan 3, 2024, at 9:25 PM, Benchao Li <[email protected]> wrote: > > > > This is a very fundamental use case in Flink. And in Flink, we have a > > mechanism that breaks the DAG into sub trees[1] and then utilizes > > Calcite to do the optimization of each sub tree. This prevents many > > optimization chances across sub trees, such as field trimming, filter > > pushdown, and many more transposing rules to utilize cost-based > > optimization. > > > > The benefit we get via that way is that we can avoid many complex > > optimization corner cases since sub trees are simpler than the whole > > DAG, we even hacked to utilize this non formal character to work > > around some planner corner cases (could not produce a valid plan / > > optimization dead loops) temporarily before we finally solved the root > > cause. > > > > Solving this problem in Calcite sounds good to me, the only one > > concern is the optimization complexity (I'm not sure how you planned > > to do it yet, it's just a heads up from my experience). I know it's > > more related to how users choose the rule sets and the complexity of > > queries, however it may be one of the factors that affects the > > usability of multi-query optimization. > > > > [1] > https://github.com/apache/flink/blob/8e8e2d649e98071bc15ee768cf65da3e96f255b4/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/optimize/CommonSubGraphBasedOptimizer.scala#L58 > > > > Forward Xu <[email protected]> 于2024年1月4日周四 10:51写道: > >> > >> Thanks to the community for the related work that has been done, This > is a > >> good discussion, and I think it can be used in, for example, better > >> multi-query optimization algorithms, more effective query result caching > >> strategies, smarter materialized view selection and maintenance methods, > >> etc. These directions require further research and exploration. > >> > >> Finally here is a related paper MULTI-QUERY OPTIMIZATION AND > APPLICATIONS] > >> < > https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2015/2014/Papers/prasan-thesis.pdf > > > >> . > >> > >> Best, > >> ForwardXu > >> > >> Julian Hyde <[email protected]> 于2024年1月4日周四 04:09写道: > >> > >>> Multi-query optimization is one of the big challenges of our field. > >>> Examples of multi-queries: > >>> * an INSERT statement that writes into a table but also updates an > index, > >>> * a DAG that represents an ETL/ELT job; > >>> * a query that produces several data sets (say a list of invoices for > >>> orders and a list of products that need to be restocked); > >>> * a query that uses intermediate results more than once. > >>> > >>> Common features of multi-queries are multiple output data sets, and > >>> re-used intermediate results. In other words, the dataflow graph is a > >>> DAG rather than just a tree. > >>> > >>> I think it would be useful to frame the problem by extending SQL so > >>> that such multi-queries can be represented as a single unit. From that > >>> would follow extensions to relational algebra, and improvements to > >>> planner algorithms and cost models. > >>> > >>> I would like to hear people's thoughts before I log a Jira case with a > >>> sketch of the problem. > >>> > >>> Related work: > >>> * https://issues.apache.org/jira/browse/CALCITE-481 Spool operator > >>> * https://issues.apache.org/jira/browse/CALCITE-1440 Multiple RelNodes > >>> * https://issues.apache.org/jira/browse/CALCITE-4568 Incremental > >>> query optimization > >>> * https://issues.apache.org/jira/browse/CALCITE-129 Recursive queries > >>> > >>> Julian > >>> > > > > > > > > -- > > > > Best, > > Benchao Li > >
