Botong,
It is very exciting. Thank you for your contribution to the
community. I'm very interested in materialization and incremental computing. We
know that there are many challenges in this area, such as choosing a good
execution plan, materializing a good view, effectively hitting the target,
materializing and identifying the view, and implementing optimization rules in
specific fields. It can speed up query optimization. Thank you for your sharing
and look forward to more discussion and feedback.
ZhaoHui Xu
------------------ ???????? ------------------
??????:
"dev"
<[email protected]>;
????????: 2020??12??24??(??????) ????3:17
??????: "dev"<[email protected]>;"zengkai.zk"<[email protected]>;"zuozhiw"<[email protected]>;
????: Proposal to extend Calcite into a incremental query optimizer
Hi all,
This is a proposal to extend the Calcite optimizer into a general incremental
query optimizer, based on our research paper published in VLDB 2021:
Tempura: a general cost-based optimizer framework for incremental data
processing
We also have a demo in SIGMOD 2020 illustrating how Alibaba??s data warehouse
is planning to use this incremental query optimizer to alleviate cluster-wise
resource skewness:
Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing
To our best knowledge, this is the first general cost-based incremental
optimizer that can find the best plan across multiple families of incremental
computing methods, including IVM, Streaming, DBToaster, etc. Experiments (in
the paper) shows that the generated best plan is consistently much better than
the plans from each individual method alone.
In general, incremental query planning is central to database view maintenance
and stream processing systems, and are being adopted in active databases,
resumable query execution, approximate query processing, etc. We are hoping
that this feature can help widening the spectrum of Calcite, solicit more use
cases and adoption of Calcite.
Below is a brief description of the technical details. Please refer to the
Tempura paper for more details. We are also working on a journal version of the
paper with more implementation details.
Currently the query plan generated by Calcite is meant to be executed
altogether at once. In the proposal, Calcite??s memo will be extended with
temporal information so that it is capable of generating incremental plans that
include multiple sub-plans to execute at different time points.
The main idea is to view each table as one that changes over time (Time Varying
Relations (TVR)). To achieve that we introduced TvrMetaSet into Calcite??s memo
besides RelSet and RelSubset to track related RelSets of a changing table (e.g.
snapshot of the table at certain time, delta of the table between two time
points, etc.).
For example in the above figure, each vertical line is a TvrMetaSet
representing a TVR (S, R, S left outer join R, etc.). Horizontal lines
represent time. Each black dot in the grid is a RelSet. Users can write TVR
Rewrite Rules to describe valid transformations between these dots. For
example, the blues lines are inter-TVR rules that describe how to compute
certain RelSet of a TVR from RelSets of other TVRs. The red lines are intra-TVR
rules that describe transformations within a TVR. All TVR rewrite rules are
logical rules. All existing Calcite rules still work in the new volcano system
without modification.
All changes in this feature will consist of four parts:
1. Memo extension with TvrMetaSet
2. Rule engine upgrade, capable of matching TvrMetaSet and RelNodes, as well as
links in between the nodes.
3. A basic set of TvrRules, written using the upgraded rule engine API.
4. Multi-query optimization, used to find the best incremental plan involving
multiple time points.
Note that this feature is an extension in nature and thus when disabled, does
not change any existing Calcite behavior.
Other than scenarios in the paper, we also applied this Calcite-extended
incremental query optimizer to a type of periodic query called the ????range
query???? in Alibaba??s data warehouse. It achieved cost savings of 80% on
total CPU and memory consumption, and 60% on end-to-end execution time.
All comments and suggestions are welcome. Thanks and happy holidays!
Best,
Botong