Re: Proposal to extend Calcite into a incremental query optimizer

Rui Wang Thu, 28 Jan 2021 10:54:30 -0800

For me to participate in the discussion for the above questions, I will
need to read a lot more to know relevant context and likely ask lots of
questions :-).  A editable doc is probably good for questions and back and
forward discussion.



-Rui

On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> wrote:

> I am also happy to help push this work into Calcite (review code and doc,
> etc.).
>
> While you can share your code so people can have more idea how it is
> implemented, I think it would be also nice to have a doc to discuss open
> questions above. Some points that I copy those to here:
>
> 1. Can this solution be compatible with existing solutions in Calcite
> Streaming, materialized view maintenance, and multi-query optimization
> (Sigma and Delta relational operators, lattice, and Spool operator),
> 2. Did you find that you needed two separate cost models - one for “view
> maintenance” and another for “user queries” - since the objectives of each
> activity are so different?
> 3. whether this work will hasten the arrival of multi-objective parametric
> query optimization [1] in Calcite.
> 4. probably SQL shell support.
>
>
> [1]:
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>
>
> -Rui
>
>
>
> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:
>
>> it would be very nice to see a POC of your work.
>>
>>
>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> wrote:
>>
>> > Hi Julian,
>> >
>> > Just wondering if there are any updates? We are wondering if it would
>> help
>> > to post our code for a quick preview.
>> >
>> > Thanks,
>> > Botong
>> >
>> > On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote:
>> >
>> > > Hi Julian,
>> > >
>> > > Thanks for your interest! Sure let's figure out a plan that best
>> benefits
>> > > the community. Here are some clarifications that hopefully answer your
>> > > questions.
>> > >
>> > > In our work (Tempura), users specify the set of time points to
>> consider
>> > > running and a cost function that expresses users' preference over
>> time,
>> > > Tempura will generate the best incremental plan that minimizes the
>> > overall
>> > > cost function.
>> > >
>> > > In this incremental plan, the sub-plans at different time points can
>> be
>> > > different from each other, as opposed to identical plans in all delta
>> > runs
>> > > as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
>> can
>> > > mimic the current streaming implementation by specifying two (logical)
>> > time
>> > > points in Tempura, representing the initial run and later delta runs
>> > > respectively. In general, note that Tempura supports various form of
>> > > incremental computing, not only the small-delta append-only data
>> model in
>> > > streaming systems. That's why we believe Tempura subsumes the current
>> > > streaming support, as well as any IVM implementations.
>> > >
>> > > About the cost model, we did not come up with a seperate cost model,
>> but
>> > > rather extended the existing one. Similar to multi-objective
>> > optimization,
>> > > costs incurred at different time points are considered different
>> > > dimensions. Tempura lets users supply a function that converts this
>> cost
>> > > vector into a final cost. So under this function, any two incremental
>> > plans
>> > > are still comparable and there is an overall optimum. I guess we can
>> go
>> > > down the route of multi-objective parametric query optimization
>> instead
>> > if
>> > > there is a need.
>> > >
>> > > Next on materialized views and multi-query optimization, since our
>> > > multi-time-point plan naturally involves materializing intermediate
>> > results
>> > > for later time points, we need to solve the problem of choosing
>> > > materializations and include the cost of saving and reusing the
>> > > materializations when costing and comparing plans. We borrowed the
>> > > multi-query optimization techniques to solve this problem even though
>> we
>> > > are looking at a single query. As a result, we think our work is
>> > orthogonal
>> > > to Calcite's facilities around utilizing existing views, lattice etc.
>> We
>> > do
>> > > feel that the multi-query optimization component can be adopted to
>> wider
>> > > use, but probably need more suggestions from the community.
>> > >
>> > > Lastly, our current implementation is set up in java code, it should
>> be
>> > > straightforward to hook it up with SQL shell.
>> > >
>> > > Thanks,
>> > > Botong
>> > >
>> > > On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]>
>> > > wrote:
>> > >
>> > >> Botong,
>> > >>
>> > >> This is very exciting; congratulations on this research, and thank
>> you
>> > >> for contributing it back to Calcite.
>> > >>
>> > >> The research touches several areas in Calcite: streaming,
>> materialized
>> > >> view maintenance, and multi-query optimization. As we have already
>> some
>> > >> solutions in those areas (Sigma and Delta relational operators,
>> lattice,
>> > >> and Spool operator), it will be interesting to see whether we can
>> make
>> > them
>> > >> compatible, or whether one concept can subsume others.
>> > >>
>> > >> Your work differs from streaming queries in that your relations are
>> used
>> > >> by “external” user queries, whereas in pure streaming queries, the
>> only
>> > >> activity is the change propagation. Did you find that you needed two
>> > >> separate cost models - one for “view maintenance” and another for
>> “user
>> > >> queries” - since the objectives of each activity are so different?
>> > >>
>> > >> I wonder whether this work will hasten the arrival of multi-objective
>> > >> parametric query optimization [1] in Calcite.
>> > >>
>> > >> I will make time over the next few days to read and digest your
>> paper.
>> > >> Then I expect that we will have a back-and-forth process to create
>> > >> something that will be useful for the broader community.
>> > >>
>> > >> One thing will be particularly useful: making this functionality
>> > >> available from a SQL shell, so that people can experiment with this
>> > >> functionality without writing Java code or setting up complex
>> databases
>> > and
>> > >> metadata. I have in mind something like the simple DDL operations
>> that
>> > are
>> > >> available in Calcite’s ’server’ module. I wonder whether we could
>> devise
>> > >> some kind of SQL syntax for a “multi-query”.
>> > >>
>> > >> Julian
>> > >>
>> > >> [1]
>> > >>
>> >
>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>> > >>
>> > >>
>> > >>
>> > >> > On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]>
>> wrote:
>> > >> >
>> > >> > Thanks Aron for pointing this out. To see the figure, please refer
>> to
>> > >> Fig
>> > >> > 3(a) in our paper:
>> > >> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
>> > >> >
>> > >> > Best,
>> > >> > Botong
>> > >> >
>> > >> > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]>
>> > wrote:
>> > >> >
>> > >> >> Seems interesting, the pic can not be seen in the mail, may you
>> open
>> > a
>> > >> JIRA
>> > >> >> for this, people who are interested in this can subscribe to the
>> > JIRA?
>> > >> >>
>> > >> >>
>> > >> >> Regards!
>> > >> >>
>> > >> >> Aron Tao
>> > >> >>
>> > >> >>
>> > >> >> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
>> > >> >>
>> > >> >>> Hi all,
>> > >> >>>
>> > >> >>> This is a proposal to extend the Calcite optimizer into a general
>> > >> >>> incremental query optimizer, based on our research paper
>> published
>> > in
>> > >> >> VLDB
>> > >> >>> 2021:
>> > >> >>> Tempura: a general cost-based optimizer framework for incremental
>> > data
>> > >> >>> processing
>> > >> >>>
>> > >> >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s
>> data
>> > >> >>> warehouse is planning to use this incremental query optimizer to
>> > >> >> alleviate
>> > >> >>> cluster-wise resource skewness:
>> > >> >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
>> > >> >> Computing
>> > >> >>>
>> > >> >>> To our best knowledge, this is the first general cost-based
>> > >> incremental
>> > >> >>> optimizer that can find the best plan across multiple families of
>> > >> >>> incremental computing methods, including IVM, Streaming,
>> DBToaster,
>> > >> etc.
>> > >> >>> Experiments (in the paper) shows that the generated best plan is
>> > >> >>> consistently much better than the plans from each individual
>> method
>> > >> >> alone.
>> > >> >>>
>> > >> >>> In general, incremental query planning is central to database
>> view
>> > >> >>> maintenance and stream processing systems, and are being adopted
>> in
>> > >> >> active
>> > >> >>> databases, resumable query execution, approximate query
>> processing,
>> > >> etc.
>> > >> >> We
>> > >> >>> are hoping that this feature can help widening the spectrum of
>> > >> Calcite,
>> > >> >>> solicit more use cases and adoption of Calcite.
>> > >> >>>
>> > >> >>> Below is a brief description of the technical details. Please
>> refer
>> > to
>> > >> >> the
>> > >> >>> Tempura paper for more details. We are also working on a journal
>> > >> version
>> > >> >> of
>> > >> >>> the paper with more implementation details.
>> > >> >>>
>> > >> >>> Currently the query plan generated by Calcite is meant to be
>> > executed
>> > >> >>> altogether at once. In the proposal, Calcite’s memo will be
>> extended
>> > >> with
>> > >> >>> temporal information so that it is capable of generating
>> incremental
>> > >> >> plans
>> > >> >>> that include multiple sub-plans to execute at different time
>> points.
>> > >> >>>
>> > >> >>> The main idea is to view each table as one that changes over time
>> > >> (Time
>> > >> >>> Varying Relations (TVR)). To achieve that we introduced
>> TvrMetaSet
>> > >> into
>> > >> >>> Calcite’s memo besides RelSet and RelSubset to track related
>> RelSets
>> > >> of a
>> > >> >>> changing table (e.g. snapshot of the table at certain time,
>> delta of
>> > >> the
>> > >> >>> table between two time points, etc.).
>> > >> >>>
>> > >> >>> [image: image.png]
>> > >> >>>
>> > >> >>> For example in the above figure, each vertical line is a
>> TvrMetaSet
>> > >> >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal
>> > lines
>> > >> >>> represent time. Each black dot in the grid is a RelSet. Users can
>> > >> write
>> > >> >> TVR
>> > >> >>> Rewrite Rules to describe valid transformations between these
>> dots.
>> > >> For
>> > >> >>> example, the blues lines are inter-TVR rules that describe how to
>> > >> compute
>> > >> >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines
>> > are
>> > >> >>> intra-TVR rules that describe transformations within a TVR. All
>> TVR
>> > >> >> rewrite
>> > >> >>> rules are logical rules. All existing Calcite rules still work in
>> > the
>> > >> new
>> > >> >>> volcano system without modification.
>> > >> >>>
>> > >> >>> All changes in this feature will consist of four parts:
>> > >> >>> 1. Memo extension with TvrMetaSet
>> > >> >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
>> RelNodes,
>> > >> as
>> > >> >>> well as links in between the nodes.
>> > >> >>> 3. A basic set of TvrRules, written using the upgraded rule
>> engine
>> > >> API.
>> > >> >>> 4. Multi-query optimization, used to find the best incremental
>> plan
>> > >> >>> involving multiple time points.
>> > >> >>>
>> > >> >>> Note that this feature is an extension in nature and thus when
>> > >> disabled,
>> > >> >>> does not change any existing Calcite behavior.
>> > >> >>>
>> > >> >>> Other than scenarios in the paper, we also applied this
>> > >> Calcite-extended
>> > >> >>> incremental query optimizer to a type of periodic query called
>> the
>> > >> >> ‘‘range
>> > >> >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of
>> 80%
>> > >> on
>> > >> >>> total CPU and memory consumption, and 60% on end-to-end execution
>> > >> time.
>> > >> >>>
>> > >> >>> All comments and suggestions are welcome. Thanks and happy
>> holidays!
>> > >> >>>
>> > >> >>> Best,
>> > >> >>> Botong
>> > >> >>>
>> > >> >>
>> > >>
>> > >>
>> >
>>
>>
>> --
>> ~~~~~~~~~~~~~~~
>> no mistakes
>> ~~~~~~~~~~~~~~~~~~
>>
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to