Re: Proposal to extend Calcite into a incremental query optimizer

Botong Huang Wed, 27 Jan 2021 18:21:20 -0800

Hi Julian,

Just wondering if there are any updates? We are wondering if it would help
to post our code for a quick preview.


Thanks,
Botong

On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote:

> Hi Julian,
>
> Thanks for your interest! Sure let's figure out a plan that best benefits
> the community. Here are some clarifications that hopefully answer your
> questions.
>
> In our work (Tempura), users specify the set of time points to consider
> running and a cost function that expresses users' preference over time,
> Tempura will generate the best incremental plan that minimizes the overall
> cost function.
>
> In this incremental plan, the sub-plans at different time points can be
> different from each other, as opposed to identical plans in all delta runs
> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we can
> mimic the current streaming implementation by specifying two (logical) time
> points in Tempura, representing the initial run and later delta runs
> respectively. In general, note that Tempura supports various form of
> incremental computing, not only the small-delta append-only data model in
> streaming systems. That's why we believe Tempura subsumes the current
> streaming support, as well as any IVM implementations.
>
> About the cost model, we did not come up with a seperate cost model, but
> rather extended the existing one. Similar to multi-objective optimization,
> costs incurred at different time points are considered different
> dimensions. Tempura lets users supply a function that converts this cost
> vector into a final cost. So under this function, any two incremental plans
> are still comparable and there is an overall optimum. I guess we can go
> down the route of multi-objective parametric query optimization instead if
> there is a need.
>
> Next on materialized views and multi-query optimization, since our
> multi-time-point plan naturally involves materializing intermediate results
> for later time points, we need to solve the problem of choosing
> materializations and include the cost of saving and reusing the
> materializations when costing and comparing plans. We borrowed the
> multi-query optimization techniques to solve this problem even though we
> are looking at a single query. As a result, we think our work is orthogonal
> to Calcite's facilities around utilizing existing views, lattice etc. We do
> feel that the multi-query optimization component can be adopted to wider
> use, but probably need more suggestions from the community.
>
> Lastly, our current implementation is set up in java code, it should be
> straightforward to hook it up with SQL shell.
>
> Thanks,
> Botong
>
> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]>
> wrote:
>
>> Botong,
>>
>> This is very exciting; congratulations on this research, and thank you
>> for contributing it back to Calcite.
>>
>> The research touches several areas in Calcite: streaming, materialized
>> view maintenance, and multi-query optimization. As we have already some
>> solutions in those areas (Sigma and Delta relational operators, lattice,
>> and Spool operator), it will be interesting to see whether we can make them
>> compatible, or whether one concept can subsume others.
>>
>> Your work differs from streaming queries in that your relations are used
>> by “external” user queries, whereas in pure streaming queries, the only
>> activity is the change propagation. Did you find that you needed two
>> separate cost models - one for “view maintenance” and another for “user
>> queries” - since the objectives of each activity are so different?
>>
>> I wonder whether this work will hasten the arrival of multi-objective
>> parametric query optimization [1] in Calcite.
>>
>> I will make time over the next few days to read and digest your paper.
>> Then I expect that we will have a back-and-forth process to create
>> something that will be useful for the broader community.
>>
>> One thing will be particularly useful: making this functionality
>> available from a SQL shell, so that people can experiment with this
>> functionality without writing Java code or setting up complex databases and
>> metadata. I have in mind something like the simple DDL operations that are
>> available in Calcite’s ’server’ module. I wonder whether we could devise
>> some kind of SQL syntax for a “multi-query”.
>>
>> Julian
>>
>> [1]
>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>>
>>
>>
>> > On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> wrote:
>> >
>> > Thanks Aron for pointing this out. To see the figure, please refer to
>> Fig
>> > 3(a) in our paper:
>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
>> >
>> > Best,
>> > Botong
>> >
>> > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> wrote:
>> >
>> >> Seems interesting, the pic can not be seen in the mail, may you open a
>> JIRA
>> >> for this, people who are interested in this can subscribe to the JIRA?
>> >>
>> >>
>> >> Regards!
>> >>
>> >> Aron Tao
>> >>
>> >>
>> >> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
>> >>
>> >>> Hi all,
>> >>>
>> >>> This is a proposal to extend the Calcite optimizer into a general
>> >>> incremental query optimizer, based on our research paper published in
>> >> VLDB
>> >>> 2021:
>> >>> Tempura: a general cost-based optimizer framework for incremental data
>> >>> processing
>> >>>
>> >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data
>> >>> warehouse is planning to use this incremental query optimizer to
>> >> alleviate
>> >>> cluster-wise resource skewness:
>> >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
>> >> Computing
>> >>>
>> >>> To our best knowledge, this is the first general cost-based
>> incremental
>> >>> optimizer that can find the best plan across multiple families of
>> >>> incremental computing methods, including IVM, Streaming, DBToaster,
>> etc.
>> >>> Experiments (in the paper) shows that the generated best plan is
>> >>> consistently much better than the plans from each individual method
>> >> alone.
>> >>>
>> >>> In general, incremental query planning is central to database view
>> >>> maintenance and stream processing systems, and are being adopted in
>> >> active
>> >>> databases, resumable query execution, approximate query processing,
>> etc.
>> >> We
>> >>> are hoping that this feature can help widening the spectrum of
>> Calcite,
>> >>> solicit more use cases and adoption of Calcite.
>> >>>
>> >>> Below is a brief description of the technical details. Please refer to
>> >> the
>> >>> Tempura paper for more details. We are also working on a journal
>> version
>> >> of
>> >>> the paper with more implementation details.
>> >>>
>> >>> Currently the query plan generated by Calcite is meant to be executed
>> >>> altogether at once. In the proposal, Calcite’s memo will be extended
>> with
>> >>> temporal information so that it is capable of generating incremental
>> >> plans
>> >>> that include multiple sub-plans to execute at different time points.
>> >>>
>> >>> The main idea is to view each table as one that changes over time
>> (Time
>> >>> Varying Relations (TVR)). To achieve that we introduced TvrMetaSet
>> into
>> >>> Calcite’s memo besides RelSet and RelSubset to track related RelSets
>> of a
>> >>> changing table (e.g. snapshot of the table at certain time, delta of
>> the
>> >>> table between two time points, etc.).
>> >>>
>> >>> [image: image.png]
>> >>>
>> >>> For example in the above figure, each vertical line is a TvrMetaSet
>> >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal lines
>> >>> represent time. Each black dot in the grid is a RelSet. Users can
>> write
>> >> TVR
>> >>> Rewrite Rules to describe valid transformations between these dots.
>> For
>> >>> example, the blues lines are inter-TVR rules that describe how to
>> compute
>> >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines are
>> >>> intra-TVR rules that describe transformations within a TVR. All TVR
>> >> rewrite
>> >>> rules are logical rules. All existing Calcite rules still work in the
>> new
>> >>> volcano system without modification.
>> >>>
>> >>> All changes in this feature will consist of four parts:
>> >>> 1. Memo extension with TvrMetaSet
>> >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and RelNodes,
>> as
>> >>> well as links in between the nodes.
>> >>> 3. A basic set of TvrRules, written using the upgraded rule engine
>> API.
>> >>> 4. Multi-query optimization, used to find the best incremental plan
>> >>> involving multiple time points.
>> >>>
>> >>> Note that this feature is an extension in nature and thus when
>> disabled,
>> >>> does not change any existing Calcite behavior.
>> >>>
>> >>> Other than scenarios in the paper, we also applied this
>> Calcite-extended
>> >>> incremental query optimizer to a type of periodic query called the
>> >> ‘‘range
>> >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of 80%
>> on
>> >>> total CPU and memory consumption, and 60% on end-to-end execution
>> time.
>> >>>
>> >>> All comments and suggestions are welcome. Thanks and happy holidays!
>> >>>
>> >>> Best,
>> >>> Botong
>> >>>
>> >>
>>
>>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to