Re: Proposal to extend Calcite into a incremental query optimizer

Rui Wang Thu, 28 Jan 2021 10:50:45 -0800

I am also happy to help push this work into Calcite (review code and doc,
etc.).


While you can share your code so people can have more idea how it is
implemented, I think it would be also nice to have a doc to discuss open
questions above. Some points that I copy those to here:

1. Can this solution be compatible with existing solutions in Calcite
Streaming, materialized view maintenance, and multi-query optimization
(Sigma and Delta relational operators, lattice, and Spool operator),
2. Did you find that you needed two separate cost models - one for “view
maintenance” and another for “user queries” - since the objectives of each
activity are so different?
3. whether this work will hasten the arrival of multi-objective parametric
query optimization [1] in Calcite.
4. probably SQL shell support.


[1]:
https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext


-Rui



On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:

> it would be very nice to see a POC of your work.
>
>
> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> wrote:
>
> > Hi Julian,
> >
> > Just wondering if there are any updates? We are wondering if it would
> help
> > to post our code for a quick preview.
> >
> > Thanks,
> > Botong
> >
> > On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote:
> >
> > > Hi Julian,
> > >
> > > Thanks for your interest! Sure let's figure out a plan that best
> benefits
> > > the community. Here are some clarifications that hopefully answer your
> > > questions.
> > >
> > > In our work (Tempura), users specify the set of time points to consider
> > > running and a cost function that expresses users' preference over time,
> > > Tempura will generate the best incremental plan that minimizes the
> > overall
> > > cost function.
> > >
> > > In this incremental plan, the sub-plans at different time points can be
> > > different from each other, as opposed to identical plans in all delta
> > runs
> > > as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
> can
> > > mimic the current streaming implementation by specifying two (logical)
> > time
> > > points in Tempura, representing the initial run and later delta runs
> > > respectively. In general, note that Tempura supports various form of
> > > incremental computing, not only the small-delta append-only data model
> in
> > > streaming systems. That's why we believe Tempura subsumes the current
> > > streaming support, as well as any IVM implementations.
> > >
> > > About the cost model, we did not come up with a seperate cost model,
> but
> > > rather extended the existing one. Similar to multi-objective
> > optimization,
> > > costs incurred at different time points are considered different
> > > dimensions. Tempura lets users supply a function that converts this
> cost
> > > vector into a final cost. So under this function, any two incremental
> > plans
> > > are still comparable and there is an overall optimum. I guess we can go
> > > down the route of multi-objective parametric query optimization instead
> > if
> > > there is a need.
> > >
> > > Next on materialized views and multi-query optimization, since our
> > > multi-time-point plan naturally involves materializing intermediate
> > results
> > > for later time points, we need to solve the problem of choosing
> > > materializations and include the cost of saving and reusing the
> > > materializations when costing and comparing plans. We borrowed the
> > > multi-query optimization techniques to solve this problem even though
> we
> > > are looking at a single query. As a result, we think our work is
> > orthogonal
> > > to Calcite's facilities around utilizing existing views, lattice etc.
> We
> > do
> > > feel that the multi-query optimization component can be adopted to
> wider
> > > use, but probably need more suggestions from the community.
> > >
> > > Lastly, our current implementation is set up in java code, it should be
> > > straightforward to hook it up with SQL shell.
> > >
> > > Thanks,
> > > Botong
> > >
> > > On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]>
> > > wrote:
> > >
> > >> Botong,
> > >>
> > >> This is very exciting; congratulations on this research, and thank you
> > >> for contributing it back to Calcite.
> > >>
> > >> The research touches several areas in Calcite: streaming, materialized
> > >> view maintenance, and multi-query optimization. As we have already
> some
> > >> solutions in those areas (Sigma and Delta relational operators,
> lattice,
> > >> and Spool operator), it will be interesting to see whether we can make
> > them
> > >> compatible, or whether one concept can subsume others.
> > >>
> > >> Your work differs from streaming queries in that your relations are
> used
> > >> by “external” user queries, whereas in pure streaming queries, the
> only
> > >> activity is the change propagation. Did you find that you needed two
> > >> separate cost models - one for “view maintenance” and another for
> “user
> > >> queries” - since the objectives of each activity are so different?
> > >>
> > >> I wonder whether this work will hasten the arrival of multi-objective
> > >> parametric query optimization [1] in Calcite.
> > >>
> > >> I will make time over the next few days to read and digest your paper.
> > >> Then I expect that we will have a back-and-forth process to create
> > >> something that will be useful for the broader community.
> > >>
> > >> One thing will be particularly useful: making this functionality
> > >> available from a SQL shell, so that people can experiment with this
> > >> functionality without writing Java code or setting up complex
> databases
> > and
> > >> metadata. I have in mind something like the simple DDL operations that
> > are
> > >> available in Calcite’s ’server’ module. I wonder whether we could
> devise
> > >> some kind of SQL syntax for a “multi-query”.
> > >>
> > >> Julian
> > >>
> > >> [1]
> > >>
> >
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
> > >>
> > >>
> > >>
> > >> > On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> wrote:
> > >> >
> > >> > Thanks Aron for pointing this out. To see the figure, please refer
> to
> > >> Fig
> > >> > 3(a) in our paper:
> > >> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
> > >> >
> > >> > Best,
> > >> > Botong
> > >> >
> > >> > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]>
> > wrote:
> > >> >
> > >> >> Seems interesting, the pic can not be seen in the mail, may you
> open
> > a
> > >> JIRA
> > >> >> for this, people who are interested in this can subscribe to the
> > JIRA?
> > >> >>
> > >> >>
> > >> >> Regards!
> > >> >>
> > >> >> Aron Tao
> > >> >>
> > >> >>
> > >> >> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
> > >> >>
> > >> >>> Hi all,
> > >> >>>
> > >> >>> This is a proposal to extend the Calcite optimizer into a general
> > >> >>> incremental query optimizer, based on our research paper published
> > in
> > >> >> VLDB
> > >> >>> 2021:
> > >> >>> Tempura: a general cost-based optimizer framework for incremental
> > data
> > >> >>> processing
> > >> >>>
> > >> >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data
> > >> >>> warehouse is planning to use this incremental query optimizer to
> > >> >> alleviate
> > >> >>> cluster-wise resource skewness:
> > >> >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
> > >> >> Computing
> > >> >>>
> > >> >>> To our best knowledge, this is the first general cost-based
> > >> incremental
> > >> >>> optimizer that can find the best plan across multiple families of
> > >> >>> incremental computing methods, including IVM, Streaming,
> DBToaster,
> > >> etc.
> > >> >>> Experiments (in the paper) shows that the generated best plan is
> > >> >>> consistently much better than the plans from each individual
> method
> > >> >> alone.
> > >> >>>
> > >> >>> In general, incremental query planning is central to database view
> > >> >>> maintenance and stream processing systems, and are being adopted
> in
> > >> >> active
> > >> >>> databases, resumable query execution, approximate query
> processing,
> > >> etc.
> > >> >> We
> > >> >>> are hoping that this feature can help widening the spectrum of
> > >> Calcite,
> > >> >>> solicit more use cases and adoption of Calcite.
> > >> >>>
> > >> >>> Below is a brief description of the technical details. Please
> refer
> > to
> > >> >> the
> > >> >>> Tempura paper for more details. We are also working on a journal
> > >> version
> > >> >> of
> > >> >>> the paper with more implementation details.
> > >> >>>
> > >> >>> Currently the query plan generated by Calcite is meant to be
> > executed
> > >> >>> altogether at once. In the proposal, Calcite’s memo will be
> extended
> > >> with
> > >> >>> temporal information so that it is capable of generating
> incremental
> > >> >> plans
> > >> >>> that include multiple sub-plans to execute at different time
> points.
> > >> >>>
> > >> >>> The main idea is to view each table as one that changes over time
> > >> (Time
> > >> >>> Varying Relations (TVR)). To achieve that we introduced TvrMetaSet
> > >> into
> > >> >>> Calcite’s memo besides RelSet and RelSubset to track related
> RelSets
> > >> of a
> > >> >>> changing table (e.g. snapshot of the table at certain time, delta
> of
> > >> the
> > >> >>> table between two time points, etc.).
> > >> >>>
> > >> >>> [image: image.png]
> > >> >>>
> > >> >>> For example in the above figure, each vertical line is a
> TvrMetaSet
> > >> >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal
> > lines
> > >> >>> represent time. Each black dot in the grid is a RelSet. Users can
> > >> write
> > >> >> TVR
> > >> >>> Rewrite Rules to describe valid transformations between these
> dots.
> > >> For
> > >> >>> example, the blues lines are inter-TVR rules that describe how to
> > >> compute
> > >> >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines
> > are
> > >> >>> intra-TVR rules that describe transformations within a TVR. All
> TVR
> > >> >> rewrite
> > >> >>> rules are logical rules. All existing Calcite rules still work in
> > the
> > >> new
> > >> >>> volcano system without modification.
> > >> >>>
> > >> >>> All changes in this feature will consist of four parts:
> > >> >>> 1. Memo extension with TvrMetaSet
> > >> >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
> RelNodes,
> > >> as
> > >> >>> well as links in between the nodes.
> > >> >>> 3. A basic set of TvrRules, written using the upgraded rule engine
> > >> API.
> > >> >>> 4. Multi-query optimization, used to find the best incremental
> plan
> > >> >>> involving multiple time points.
> > >> >>>
> > >> >>> Note that this feature is an extension in nature and thus when
> > >> disabled,
> > >> >>> does not change any existing Calcite behavior.
> > >> >>>
> > >> >>> Other than scenarios in the paper, we also applied this
> > >> Calcite-extended
> > >> >>> incremental query optimizer to a type of periodic query called the
> > >> >> ‘‘range
> > >> >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of
> 80%
> > >> on
> > >> >>> total CPU and memory consumption, and 60% on end-to-end execution
> > >> time.
> > >> >>>
> > >> >>> All comments and suggestions are welcome. Thanks and happy
> holidays!
> > >> >>>
> > >> >>> Best,
> > >> >>> Botong
> > >> >>>
> > >> >>
> > >>
> > >>
> >
>
>
> --
> ~~~~~~~~~~~~~~~
> no mistakes
> ~~~~~~~~~~~~~~~~~~
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to