Re: Proposal to extend Calcite into a incremental query optimizer

Botong Huang Fri, 01 Jan 2021 11:05:26 -0800

Hi Julian,

Thanks for your interest! Sure let's figure out a plan that best benefits
the community. Here are some clarifications that hopefully answer your
questions.


In our work (Tempura), users specify the set of time points to consider
running and a cost function that expresses users' preference over time,
Tempura will generate the best incremental plan that minimizes the overall
cost function.

In this incremental plan, the sub-plans at different time points can be
different from each other, as opposed to identical plans in all delta runs
as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we can
mimic the current streaming implementation by specifying two (logical) time
points in Tempura, representing the initial run and later delta runs
respectively. In general, note that Tempura supports various form of
incremental computing, not only the small-delta append-only data model in
streaming systems. That's why we believe Tempura subsumes the current
streaming support, as well as any IVM implementations.

About the cost model, we did not come up with a seperate cost model, but
rather extended the existing one. Similar to multi-objective optimization,
costs incurred at different time points are considered different
dimensions. Tempura lets users supply a function that converts this cost
vector into a final cost. So under this function, any two incremental plans
are still comparable and there is an overall optimum. I guess we can go
down the route of multi-objective parametric query optimization instead if
there is a need.

Next on materialized views and multi-query optimization, since our
multi-time-point plan naturally involves materializing intermediate results
for later time points, we need to solve the problem of choosing
materializations and include the cost of saving and reusing the
materializations when costing and comparing plans. We borrowed the
multi-query optimization techniques to solve this problem even though we
are looking at a single query. As a result, we think our work is orthogonal
to Calcite's facilities around utilizing existing views, lattice etc. We do
feel that the multi-query optimization component can be adopted to wider
use, but probably need more suggestions from the community.

Lastly, our current implementation is set up in java code, it should be
straightforward to hook it up with SQL shell.

Thanks,
Botong

On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]> wrote:

> Botong,
>
> This is very exciting; congratulations on this research, and thank you for
> contributing it back to Calcite.
>
> The research touches several areas in Calcite: streaming, materialized
> view maintenance, and multi-query optimization. As we have already some
> solutions in those areas (Sigma and Delta relational operators, lattice,
> and Spool operator), it will be interesting to see whether we can make them
> compatible, or whether one concept can subsume others.
>
> Your work differs from streaming queries in that your relations are used
> by “external” user queries, whereas in pure streaming queries, the only
> activity is the change propagation. Did you find that you needed two
> separate cost models - one for “view maintenance” and another for “user
> queries” - since the objectives of each activity are so different?
>
> I wonder whether this work will hasten the arrival of multi-objective
> parametric query optimization [1] in Calcite.
>
> I will make time over the next few days to read and digest your paper.
> Then I expect that we will have a back-and-forth process to create
> something that will be useful for the broader community.
>
> One thing will be particularly useful: making this functionality available
> from a SQL shell, so that people can experiment with this functionality
> without writing Java code or setting up complex databases and metadata. I
> have in mind something like the simple DDL operations that are available in
> Calcite’s ’server’ module. I wonder whether we could devise some kind of
> SQL syntax for a “multi-query”.
>
> Julian
>
> [1]
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>
>
>
> > On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> wrote:
> >
> > Thanks Aron for pointing this out. To see the figure, please refer to Fig
> > 3(a) in our paper:
> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
> >
> > Best,
> > Botong
> >
> > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> wrote:
> >
> >> Seems interesting, the pic can not be seen in the mail, may you open a
> JIRA
> >> for this, people who are interested in this can subscribe to the JIRA?
> >>
> >>
> >> Regards!
> >>
> >> Aron Tao
> >>
> >>
> >> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
> >>
> >>> Hi all,
> >>>
> >>> This is a proposal to extend the Calcite optimizer into a general
> >>> incremental query optimizer, based on our research paper published in
> >> VLDB
> >>> 2021:
> >>> Tempura: a general cost-based optimizer framework for incremental data
> >>> processing
> >>>
> >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data
> >>> warehouse is planning to use this incremental query optimizer to
> >> alleviate
> >>> cluster-wise resource skewness:
> >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
> >> Computing
> >>>
> >>> To our best knowledge, this is the first general cost-based incremental
> >>> optimizer that can find the best plan across multiple families of
> >>> incremental computing methods, including IVM, Streaming, DBToaster,
> etc.
> >>> Experiments (in the paper) shows that the generated best plan is
> >>> consistently much better than the plans from each individual method
> >> alone.
> >>>
> >>> In general, incremental query planning is central to database view
> >>> maintenance and stream processing systems, and are being adopted in
> >> active
> >>> databases, resumable query execution, approximate query processing,
> etc.
> >> We
> >>> are hoping that this feature can help widening the spectrum of Calcite,
> >>> solicit more use cases and adoption of Calcite.
> >>>
> >>> Below is a brief description of the technical details. Please refer to
> >> the
> >>> Tempura paper for more details. We are also working on a journal
> version
> >> of
> >>> the paper with more implementation details.
> >>>
> >>> Currently the query plan generated by Calcite is meant to be executed
> >>> altogether at once. In the proposal, Calcite’s memo will be extended
> with
> >>> temporal information so that it is capable of generating incremental
> >> plans
> >>> that include multiple sub-plans to execute at different time points.
> >>>
> >>> The main idea is to view each table as one that changes over time (Time
> >>> Varying Relations (TVR)). To achieve that we introduced TvrMetaSet into
> >>> Calcite’s memo besides RelSet and RelSubset to track related RelSets
> of a
> >>> changing table (e.g. snapshot of the table at certain time, delta of
> the
> >>> table between two time points, etc.).
> >>>
> >>> [image: image.png]
> >>>
> >>> For example in the above figure, each vertical line is a TvrMetaSet
> >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal lines
> >>> represent time. Each black dot in the grid is a RelSet. Users can write
> >> TVR
> >>> Rewrite Rules to describe valid transformations between these dots. For
> >>> example, the blues lines are inter-TVR rules that describe how to
> compute
> >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines are
> >>> intra-TVR rules that describe transformations within a TVR. All TVR
> >> rewrite
> >>> rules are logical rules. All existing Calcite rules still work in the
> new
> >>> volcano system without modification.
> >>>
> >>> All changes in this feature will consist of four parts:
> >>> 1. Memo extension with TvrMetaSet
> >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and RelNodes, as
> >>> well as links in between the nodes.
> >>> 3. A basic set of TvrRules, written using the upgraded rule engine API.
> >>> 4. Multi-query optimization, used to find the best incremental plan
> >>> involving multiple time points.
> >>>
> >>> Note that this feature is an extension in nature and thus when
> disabled,
> >>> does not change any existing Calcite behavior.
> >>>
> >>> Other than scenarios in the paper, we also applied this
> Calcite-extended
> >>> incremental query optimizer to a type of periodic query called the
> >> ‘‘range
> >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of 80% on
> >>> total CPU and memory consumption, and 60% on end-to-end execution time.
> >>>
> >>> All comments and suggestions are welcome. Thanks and happy holidays!
> >>>
> >>> Best,
> >>> Botong
> >>>
> >>
>
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to