Re: Multi-query optimization

Julian Hyde Thu, 04 Jan 2024 12:47:51 -0800

I didn’t actually have an algorithm in mind. I wanted to kick off a research 
project, by saying ’This is a common problem, and here is a language for 
describing it. Does anyone have ideas for how to solve it?’.


The issue keeps coming up, and I want to be able to say ‘Yes, we’ve been 
thinking about that in Calcite’.

So, I’ll go ahead and log that Jira case. I’d welcome improvements — better SQL 
syntax, better set of use cases — from people who have been working in this 
space.

I hadn’t thought about field trimming across trees, but that sounds very 
important and useful.

I suspect that the Volcano approach will get us 80% of the way there. Dynamic 
programming doesn’t really care whether we are working on a tree, a forest, or 
a collection of trees loosely connected into a DAG. The only problem — as noted 
in https://issues.apache.org/jira/browse/CALCITE-481 — is that Volcano’s cost 
model will tend to double-count.

But my goal isn’t to solve the problem. Once we have a good set of use cases 
(pure query, pure DML, hybrid query+DML, cyclic data flows) expressed in SQL, 
we can start discussing algorithms.


> On Jan 3, 2024, at 9:25 PM, Benchao Li <[email protected]> wrote:
> 
> This is a very fundamental use case in Flink. And in Flink, we have a
> mechanism that breaks the DAG into sub trees[1] and then utilizes
> Calcite to do the optimization of each sub tree. This prevents many
> optimization chances across sub trees, such as field trimming, filter
> pushdown, and many more transposing rules to utilize cost-based
> optimization.
> 
> The benefit we get via that way is that we can avoid many complex
> optimization corner cases since sub trees are simpler than the whole
> DAG, we even hacked to utilize this non formal character to work
> around some planner corner cases (could not produce a valid plan /
> optimization dead loops) temporarily before we finally solved the root
> cause.
> 
> Solving this problem in Calcite sounds good to me, the only one
> concern is the optimization complexity (I'm not sure how you planned
> to do it yet, it's just a heads up from my experience). I know it's
> more related to how users choose the rule sets and the complexity of
> queries, however it may be one of the factors that affects the
> usability of multi-query optimization.
> 
> [1] 
> https://github.com/apache/flink/blob/8e8e2d649e98071bc15ee768cf65da3e96f255b4/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/optimize/CommonSubGraphBasedOptimizer.scala#L58
> 
> Forward Xu <[email protected]> 于2024年1月4日周四 10:51写道：
>> 
>> Thanks to the community for the related work that has been done, This is a
>> good discussion,  and I think it can be used in, for example, better
>> multi-query optimization algorithms, more effective query result caching
>> strategies, smarter materialized view selection and maintenance methods,
>> etc. These directions require further research and exploration.
>> 
>> Finally here is a related paper MULTI-QUERY OPTIMIZATION AND APPLICATIONS]
>> <https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2015/2014/Papers/prasan-thesis.pdf>
>> .
>> 
>> Best，
>> ForwardXu
>> 
>> Julian Hyde <[email protected]> 于2024年1月4日周四 04:09写道：
>> 
>>> Multi-query optimization is one of the big challenges of our field.
>>> Examples of multi-queries:
>>> * an INSERT statement that writes into a table but also updates an index,
>>> * a DAG that represents an ETL/ELT job;
>>> * a query that produces several data sets (say a list of invoices for
>>> orders and a list of products that need to be restocked);
>>> * a query that uses intermediate results more than once.
>>> 
>>> Common features of multi-queries are multiple output data sets, and
>>> re-used intermediate results. In other words, the dataflow graph is a
>>> DAG rather than just a tree.
>>> 
>>> I think it would be useful to frame the problem by extending SQL so
>>> that such multi-queries can be represented as a single unit. From that
>>> would follow extensions to relational algebra, and improvements to
>>> planner algorithms and cost models.
>>> 
>>> I would like to hear people's thoughts before I log a Jira case with a
>>> sketch of the problem.
>>> 
>>> Related work:
>>> * https://issues.apache.org/jira/browse/CALCITE-481 Spool operator
>>> * https://issues.apache.org/jira/browse/CALCITE-1440 Multiple RelNodes
>>> * https://issues.apache.org/jira/browse/CALCITE-4568 Incremental
>>> query optimization
>>> * https://issues.apache.org/jira/browse/CALCITE-129 Recursive queries
>>> 
>>> Julian
>>> 
> 
> 
> 
> -- 
> 
> Best,
> Benchao Li

Re: Multi-query optimization

Reply via email to