Botong, It is very exciting. Thank you for your contribution to the community. I'm very interested in materialization and incremental computing. We know that there are many challenges in this area, such as choosing a good execution plan, materializing a good view, effectively hitting the target, materializing and identifying the view, and implementing optimization rules in specific fields. It can speed up query optimization. Thank you for your sharing and look forward to more discussion and feedback.
ZhaoHui Xu ------------------ ???????? ------------------ ??????: "dev" <bot...@apache.org>; ????????: 2020??12??24??(??????) ????3:17 ??????: "dev"<dev@calcite.apache.org>;"zengkai.zk"<zengkai...@alibaba-inc.com>;"zuozhiw"<zuoz...@gmail.com>; ????: Proposal to extend Calcite into a incremental query optimizer Hi all, This is a proposal to extend the Calcite optimizer into a general incremental query optimizer, based on our research paper published in VLDB 2021: Tempura: a general cost-based optimizer framework for incremental data processing We also have a demo in SIGMOD 2020 illustrating how Alibaba??s data warehouse is planning to use this incremental query optimizer to alleviate cluster-wise resource skewness: Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing To our best knowledge, this is the first general cost-based incremental optimizer that can find the best plan across multiple families of incremental computing methods, including IVM, Streaming, DBToaster, etc. Experiments (in the paper) shows that the generated best plan is consistently much better than the plans from each individual method alone. In general, incremental query planning is central to database view maintenance and stream processing systems, and are being adopted in active databases, resumable query execution, approximate query processing, etc. We are hoping that this feature can help widening the spectrum of Calcite, solicit more use cases and adoption of Calcite. Below is a brief description of the technical details. Please refer to the Tempura paper for more details. We are also working on a journal version of the paper with more implementation details. Currently the query plan generated by Calcite is meant to be executed altogether at once. In the proposal, Calcite??s memo will be extended with temporal information so that it is capable of generating incremental plans that include multiple sub-plans to execute at different time points. The main idea is to view each table as one that changes over time (Time Varying Relations (TVR)). To achieve that we introduced TvrMetaSet into Calcite??s memo besides RelSet and RelSubset to track related RelSets of a changing table (e.g. snapshot of the table at certain time, delta of the table between two time points, etc.). For example in the above figure, each vertical line is a TvrMetaSet representing a TVR (S, R, S left outer join R, etc.). Horizontal lines represent time. Each black dot in the grid is a RelSet. Users can write TVR Rewrite Rules to describe valid transformations between these dots. For example, the blues lines are inter-TVR rules that describe how to compute certain RelSet of a TVR from RelSets of other TVRs. The red lines are intra-TVR rules that describe transformations within a TVR. All TVR rewrite rules are logical rules. All existing Calcite rules still work in the new volcano system without modification. All changes in this feature will consist of four parts: 1. Memo extension with TvrMetaSet 2. Rule engine upgrade, capable of matching TvrMetaSet and RelNodes, as well as links in between the nodes. 3. A basic set of TvrRules, written using the upgraded rule engine API. 4. Multi-query optimization, used to find the best incremental plan involving multiple time points. Note that this feature is an extension in nature and thus when disabled, does not change any existing Calcite behavior. Other than scenarios in the paper, we also applied this Calcite-extended incremental query optimizer to a type of periodic query called the ????range query???? in Alibaba??s data warehouse. It achieved cost savings of 80% on total CPU and memory consumption, and 60% on end-to-end execution time. All comments and suggestions are welcome. Thanks and happy holidays! Best, Botong