Hi all,

This is a proposal to extend the Calcite optimizer into a general
incremental query optimizer, based on our research paper published in VLDB
2021:
Tempura: a general cost-based optimizer framework for incremental data
processing

We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data
warehouse is planning to use this incremental query optimizer to alleviate
cluster-wise resource skewness:
Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing

To our best knowledge, this is the first general cost-based incremental
optimizer that can find the best plan across multiple families of
incremental computing methods, including IVM, Streaming, DBToaster, etc.
Experiments (in the paper) shows that the generated best plan is
consistently much better than the plans from each individual method alone.

In general, incremental query planning is central to database view
maintenance and stream processing systems, and are being adopted in active
databases, resumable query execution, approximate query processing, etc. We
are hoping that this feature can help widening the spectrum of Calcite,
solicit more use cases and adoption of Calcite.

Below is a brief description of the technical details. Please refer to the
Tempura paper for more details. We are also working on a journal version of
the paper with more implementation details.

Currently the query plan generated by Calcite is meant to be executed
altogether at once. In the proposal, Calcite’s memo will be extended with
temporal information so that it is capable of generating incremental plans
that include multiple sub-plans to execute at different time points.

The main idea is to view each table as one that changes over time (Time
Varying Relations (TVR)). To achieve that we introduced TvrMetaSet into
Calcite’s memo besides RelSet and RelSubset to track related RelSets of a
changing table (e.g. snapshot of the table at certain time, delta of the
table between two time points, etc.).

[image: image.png]

For example in the above figure, each vertical line is a TvrMetaSet
representing a TVR (S, R, S left outer join R, etc.). Horizontal lines
represent time. Each black dot in the grid is a RelSet. Users can write TVR
Rewrite Rules to describe valid transformations between these dots. For
example, the blues lines are inter-TVR rules that describe how to compute
certain RelSet of a TVR from RelSets of other TVRs. The red lines are
intra-TVR rules that describe transformations within a TVR. All TVR rewrite
rules are logical rules. All existing Calcite rules still work in the new
volcano system without modification.

All changes in this feature will consist of four parts:
1. Memo extension with TvrMetaSet
2. Rule engine upgrade, capable of matching TvrMetaSet and RelNodes, as
well as links in between the nodes.
3. A basic set of TvrRules, written using the upgraded rule engine API.
4. Multi-query optimization, used to find the best incremental plan
involving multiple time points.

Note that this feature is an extension in nature and thus when disabled,
does not change any existing Calcite behavior.

Other than scenarios in the paper, we also applied this Calcite-extended
incremental query optimizer to a type of periodic query called the ‘‘range
query’’ in Alibaba’s data warehouse. It achieved cost savings of 80% on
total CPU and memory consumption, and 60% on end-to-end execution time.

All comments and suggestions are welcome. Thanks and happy holidays!

Best,
Botong

Reply via email to