Re: Proposal to extend Calcite into a incremental query optimizer

Botong Huang Wed, 07 Apr 2021 14:47:04 -0700

Hi Julian and all,

We've posted the Tempura code base below. Feel free to take a quick peek at
the last five commits.
https://github.com/alibaba/cost-based-incremental-optimizer/commits/main


I've also opened a Jira (CALCITE-4568
<https://issues.apache.org/jira/browse/CALCITE-4568>), which will serve as
the umbrella Jira for the feature.

In the meantime, we encourage everyone to enter the time preferences for
our first meeting here:
https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing

Thanks,
Botong

On Mon, Apr 5, 2021 at 3:59 PM Julian Hyde <[email protected]> wrote:

> I have added my time preferences to the doc.
>
> Before we meet, could you publish a PR for us to review?
>
> Initial discussions will need to be about architecture and high-level
> design. So I would ask Calcite reviewers not to review the PR line-by-line
> (or to leave comments in GitHub) but try to understand the design
> holistically, and prepare questions/comments before the meeting.
>
> Botong, Can you please create a Calcite JIRA case for this task? JIRA how
> we track long-running tasks such as this.
>
> Julian
>
>
> > On Apr 3, 2021, at 5:15 PM, Botong Huang <[email protected]> wrote:
> >
> > Hi all,
> >
> > Apology for the delay. It took us some time to clean up our code base and
> > publicly release it (which will be out soon) for a quick peek.
> >
> > We are ready to present our work. Let's schedule a time for a Zoom
> > meeting and discuss how to integrate Tempura into Calcite.
> >
> > Since some of our team members are in China, we prefer the time slot of
> > 7:00pm-11:30pm PST any day. I've added our time preference in the shared
> > doc below.
> >
> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing
> >
> > We encourage everyone to add their time preferences (during 04/15-04/30)
> in
> > this doc. In a week or so, we will try to settle a time that works for
> > most.
> >
> > Thanks,
> > Botong
> >
> > On Sat, Jan 30, 2021 at 9:19 PM Botong Huang <[email protected]> wrote:
> >
> >> Hi Julian and Rui,
> >>
> >> Sounds good to us. Please give us some time to prepare some slides for
> the
> >> meeting.
> >>
> >> I've created a doc below for discussion. Please feel free to add more in
> >> here:
> >>
> >>
> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing
> >>
> >> Thanks,
> >> Botong
> >>
> >> On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]>
> >> wrote:
> >>
> >>> PS The “editable doc” that Rui refers to is also a good idea. I think
> we
> >>> should create it to continue discussion after the first meeting.
> >>>
> >>> Julian
> >>>
> >>>> On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]>
> >>> wrote:
> >>>>
> >>>> I think good next steps would be a PR and a meeting. The PR will
> allow
> >>> us to read the code, but I think we should do the first round of
> questions
> >>> at the meeting.  The meeting could perhaps start with a presentation
> of the
> >>> paper (do you have some slides you are planning to present at VLDB,
> >>> Botong?) and then move on to questions about the concepts, which
> >>> alternatives were considered, and how the concepts map onto other
> current
> >>> and future concepts in calcite.
> >>>>
> >>>> I don’t think we should start “reviewing” the PR line-by-line at this
> >>> point. We need to understand the high-level concepts and design
> choices. If
> >>> we start reviewing the PR we will get lost in the details.
> >>>>
> >>>> I know that integrating a major change is hard; I doubt that we will
> be
> >>> able to integrate everything, but we can build understanding about
> where
> >>> calcite needs to go, and I hope integrate a good amount of code to
> help us
> >>> get there.
> >>>>
> >>>> As I said before, after the integration I would like people to be able
> >>> to experiment with it and use it in their production systems.  That
> way, it
> >>> will not be an experiment that withers, but a feature set integrates
> with
> >>> other calcite features and gets stronger over time.
> >>>>
> >>>> Julian
> >>>>
> >>>>> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote:
> >>>>>
> >>>>> For me to participate in the discussion for the above questions, I
> >>> will
> >>>>> need to read a lot more to know relevant context and likely ask lots
> of
> >>>>> questions :-).  A editable doc is probably good for questions and
> back
> >>> and
> >>>>> forward discussion.
> >>>>>
> >>>>>
> >>>>> -Rui
> >>>>>
> >>>>>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]>
> >>> wrote:
> >>>>>>
> >>>>>> I am also happy to help push this work into Calcite (review code and
> >>> doc,
> >>>>>> etc.).
> >>>>>>
> >>>>>> While you can share your code so people can have more idea how it is
> >>>>>> implemented, I think it would be also nice to have a doc to discuss
> >>> open
> >>>>>> questions above. Some points that I copy those to here:
> >>>>>>
> >>>>>> 1. Can this solution be compatible with existing solutions in
> Calcite
> >>>>>> Streaming, materialized view maintenance, and multi-query
> optimization
> >>>>>> (Sigma and Delta relational operators, lattice, and Spool operator),
> >>>>>> 2. Did you find that you needed two separate cost models - one for
> >>> “view
> >>>>>> maintenance” and another for “user queries” - since the objectives
> of
> >>> each
> >>>>>> activity are so different?
> >>>>>> 3. whether this work will hasten the arrival of multi-objective
> >>> parametric
> >>>>>> query optimization [1] in Calcite.
> >>>>>> 4. probably SQL shell support.
> >>>>>>
> >>>>>>
> >>>>>> [1]:
> >>>>>>
> >>>
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
> >>>>>>
> >>>>>>
> >>>>>> -Rui
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:
> >>>>>>>
> >>>>>>> it would be very nice to see a POC of your work.
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]>
> >>> wrote:
> >>>>>>>
> >>>>>>>> Hi Julian,
> >>>>>>>>
> >>>>>>>> Just wondering if there are any updates? We are wondering if it
> >>> would
> >>>>>>> help
> >>>>>>>> to post our code for a quick preview.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Botong
> >>>>>>>>
> >>>>>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Julian,
> >>>>>>>>>
> >>>>>>>>> Thanks for your interest! Sure let's figure out a plan that best
> >>>>>>> benefits
> >>>>>>>>> the community. Here are some clarifications that hopefully answer
> >>> your
> >>>>>>>>> questions.
> >>>>>>>>>
> >>>>>>>>> In our work (Tempura), users specify the set of time points to
> >>>>>>> consider
> >>>>>>>>> running and a cost function that expresses users' preference over
> >>>>>>> time,
> >>>>>>>>> Tempura will generate the best incremental plan that minimizes
> the
> >>>>>>>> overall
> >>>>>>>>> cost function.
> >>>>>>>>>
> >>>>>>>>> In this incremental plan, the sub-plans at different time points
> >>> can
> >>>>>>> be
> >>>>>>>>> different from each other, as opposed to identical plans in all
> >>> delta
> >>>>>>>> runs
> >>>>>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura
> paper,
> >>> we
> >>>>>>> can
> >>>>>>>>> mimic the current streaming implementation by specifying two
> >>> (logical)
> >>>>>>>> time
> >>>>>>>>> points in Tempura, representing the initial run and later delta
> >>> runs
> >>>>>>>>> respectively. In general, note that Tempura supports various form
> >>> of
> >>>>>>>>> incremental computing, not only the small-delta append-only data
> >>>>>>> model in
> >>>>>>>>> streaming systems. That's why we believe Tempura subsumes the
> >>> current
> >>>>>>>>> streaming support, as well as any IVM implementations.
> >>>>>>>>>
> >>>>>>>>> About the cost model, we did not come up with a seperate cost
> >>> model,
> >>>>>>> but
> >>>>>>>>> rather extended the existing one. Similar to multi-objective
> >>>>>>>> optimization,
> >>>>>>>>> costs incurred at different time points are considered different
> >>>>>>>>> dimensions. Tempura lets users supply a function that converts
> this
> >>>>>>> cost
> >>>>>>>>> vector into a final cost. So under this function, any two
> >>> incremental
> >>>>>>>> plans
> >>>>>>>>> are still comparable and there is an overall optimum. I guess we
> >>> can
> >>>>>>> go
> >>>>>>>>> down the route of multi-objective parametric query optimization
> >>>>>>> instead
> >>>>>>>> if
> >>>>>>>>> there is a need.
> >>>>>>>>>
> >>>>>>>>> Next on materialized views and multi-query optimization, since
> our
> >>>>>>>>> multi-time-point plan naturally involves materializing
> intermediate
> >>>>>>>> results
> >>>>>>>>> for later time points, we need to solve the problem of choosing
> >>>>>>>>> materializations and include the cost of saving and reusing the
> >>>>>>>>> materializations when costing and comparing plans. We borrowed
> the
> >>>>>>>>> multi-query optimization techniques to solve this problem even
> >>> though
> >>>>>>> we
> >>>>>>>>> are looking at a single query. As a result, we think our work is
> >>>>>>>> orthogonal
> >>>>>>>>> to Calcite's facilities around utilizing existing views, lattice
> >>> etc.
> >>>>>>> We
> >>>>>>>> do
> >>>>>>>>> feel that the multi-query optimization component can be adopted
> to
> >>>>>>> wider
> >>>>>>>>> use, but probably need more suggestions from the community.
> >>>>>>>>>
> >>>>>>>>> Lastly, our current implementation is set up in java code, it
> >>> should
> >>>>>>> be
> >>>>>>>>> straightforward to hook it up with SQL shell.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Botong
> >>>>>>>>>
> >>>>>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <
> >>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Botong,
> >>>>>>>>>>
> >>>>>>>>>> This is very exciting; congratulations on this research, and
> thank
> >>>>>>> you
> >>>>>>>>>> for contributing it back to Calcite.
> >>>>>>>>>>
> >>>>>>>>>> The research touches several areas in Calcite: streaming,
> >>>>>>> materialized
> >>>>>>>>>> view maintenance, and multi-query optimization. As we have
> already
> >>>>>>> some
> >>>>>>>>>> solutions in those areas (Sigma and Delta relational operators,
> >>>>>>> lattice,
> >>>>>>>>>> and Spool operator), it will be interesting to see whether we
> can
> >>>>>>> make
> >>>>>>>> them
> >>>>>>>>>> compatible, or whether one concept can subsume others.
> >>>>>>>>>>
> >>>>>>>>>> Your work differs from streaming queries in that your relations
> >>> are
> >>>>>>> used
> >>>>>>>>>> by “external” user queries, whereas in pure streaming queries,
> the
> >>>>>>> only
> >>>>>>>>>> activity is the change propagation. Did you find that you needed
> >>> two
> >>>>>>>>>> separate cost models - one for “view maintenance” and another
> for
> >>>>>>> “user
> >>>>>>>>>> queries” - since the objectives of each activity are so
> different?
> >>>>>>>>>>
> >>>>>>>>>> I wonder whether this work will hasten the arrival of
> >>> multi-objective
> >>>>>>>>>> parametric query optimization [1] in Calcite.
> >>>>>>>>>>
> >>>>>>>>>> I will make time over the next few days to read and digest your
> >>>>>>> paper.
> >>>>>>>>>> Then I expect that we will have a back-and-forth process to
> create
> >>>>>>>>>> something that will be useful for the broader community.
> >>>>>>>>>>
> >>>>>>>>>> One thing will be particularly useful: making this functionality
> >>>>>>>>>> available from a SQL shell, so that people can experiment with
> >>> this
> >>>>>>>>>> functionality without writing Java code or setting up complex
> >>>>>>> databases
> >>>>>>>> and
> >>>>>>>>>> metadata. I have in mind something like the simple DDL
> operations
> >>>>>>> that
> >>>>>>>> are
> >>>>>>>>>> available in Calcite’s ’server’ module. I wonder whether we
> could
> >>>>>>> devise
> >>>>>>>>>> some kind of SQL syntax for a “multi-query”.
> >>>>>>>>>>
> >>>>>>>>>> Julian
> >>>>>>>>>>
> >>>>>>>>>> [1]
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks Aron for pointing this out. To see the figure, please
> >>> refer
> >>>>>>> to
> >>>>>>>>>> Fig
> >>>>>>>>>>> 3(a) in our paper:
> >>>>>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Botong
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <
> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Seems interesting, the pic can not be seen in the mail, may
> you
> >>>>>>> open
> >>>>>>>> a
> >>>>>>>>>> JIRA
> >>>>>>>>>>>> for this, people who are interested in this can subscribe to
> the
> >>>>>>>> JIRA?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Aron Tao
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is a proposal to extend the Calcite optimizer into a
> >>> general
> >>>>>>>>>>>>> incremental query optimizer, based on our research paper
> >>>>>>> published
> >>>>>>>> in
> >>>>>>>>>>>> VLDB
> >>>>>>>>>>>>> 2021:
> >>>>>>>>>>>>> Tempura: a general cost-based optimizer framework for
> >>> incremental
> >>>>>>>> data
> >>>>>>>>>>>>> processing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s
> >>>>>>> data
> >>>>>>>>>>>>> warehouse is planning to use this incremental query optimizer
> >>> to
> >>>>>>>>>>>> alleviate
> >>>>>>>>>>>>> cluster-wise resource skewness:
> >>>>>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware
> >>> Incremental
> >>>>>>>>>>>> Computing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To our best knowledge, this is the first general cost-based
> >>>>>>>>>> incremental
> >>>>>>>>>>>>> optimizer that can find the best plan across multiple
> families
> >>> of
> >>>>>>>>>>>>> incremental computing methods, including IVM, Streaming,
> >>>>>>> DBToaster,
> >>>>>>>>>> etc.
> >>>>>>>>>>>>> Experiments (in the paper) shows that the generated best plan
> >>> is
> >>>>>>>>>>>>> consistently much better than the plans from each individual
> >>>>>>> method
> >>>>>>>>>>>> alone.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In general, incremental query planning is central to database
> >>>>>>> view
> >>>>>>>>>>>>> maintenance and stream processing systems, and are being
> >>> adopted
> >>>>>>> in
> >>>>>>>>>>>> active
> >>>>>>>>>>>>> databases, resumable query execution, approximate query
> >>>>>>> processing,
> >>>>>>>>>> etc.
> >>>>>>>>>>>> We
> >>>>>>>>>>>>> are hoping that this feature can help widening the spectrum
> of
> >>>>>>>>>> Calcite,
> >>>>>>>>>>>>> solicit more use cases and adoption of Calcite.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Below is a brief description of the technical details. Please
> >>>>>>> refer
> >>>>>>>> to
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> Tempura paper for more details. We are also working on a
> >>> journal
> >>>>>>>>>> version
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> the paper with more implementation details.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Currently the query plan generated by Calcite is meant to be
> >>>>>>>> executed
> >>>>>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be
> >>>>>>> extended
> >>>>>>>>>> with
> >>>>>>>>>>>>> temporal information so that it is capable of generating
> >>>>>>> incremental
> >>>>>>>>>>>> plans
> >>>>>>>>>>>>> that include multiple sub-plans to execute at different time
> >>>>>>> points.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The main idea is to view each table as one that changes over
> >>> time
> >>>>>>>>>> (Time
> >>>>>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced
> >>>>>>> TvrMetaSet
> >>>>>>>>>> into
> >>>>>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related
> >>>>>>> RelSets
> >>>>>>>>>> of a
> >>>>>>>>>>>>> changing table (e.g. snapshot of the table at certain time,
> >>>>>>> delta of
> >>>>>>>>>> the
> >>>>>>>>>>>>> table between two time points, etc.).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [image: image.png]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For example in the above figure, each vertical line is a
> >>>>>>> TvrMetaSet
> >>>>>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.).
> >>> Horizontal
> >>>>>>>> lines
> >>>>>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users
> >>> can
> >>>>>>>>>> write
> >>>>>>>>>>>> TVR
> >>>>>>>>>>>>> Rewrite Rules to describe valid transformations between these
> >>>>>>> dots.
> >>>>>>>>>> For
> >>>>>>>>>>>>> example, the blues lines are inter-TVR rules that describe
> how
> >>> to
> >>>>>>>>>> compute
> >>>>>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red
> >>> lines
> >>>>>>>> are
> >>>>>>>>>>>>> intra-TVR rules that describe transformations within a TVR.
> All
> >>>>>>> TVR
> >>>>>>>>>>>> rewrite
> >>>>>>>>>>>>> rules are logical rules. All existing Calcite rules still
> work
> >>> in
> >>>>>>>> the
> >>>>>>>>>> new
> >>>>>>>>>>>>> volcano system without modification.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> All changes in this feature will consist of four parts:
> >>>>>>>>>>>>> 1. Memo extension with TvrMetaSet
> >>>>>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
> >>>>>>> RelNodes,
> >>>>>>>>>> as
> >>>>>>>>>>>>> well as links in between the nodes.
> >>>>>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule
> >>>>>>> engine
> >>>>>>>>>> API.
> >>>>>>>>>>>>> 4. Multi-query optimization, used to find the best
> incremental
> >>>>>>> plan
> >>>>>>>>>>>>> involving multiple time points.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Note that this feature is an extension in nature and thus
> when
> >>>>>>>>>> disabled,
> >>>>>>>>>>>>> does not change any existing Calcite behavior.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Other than scenarios in the paper, we also applied this
> >>>>>>>>>> Calcite-extended
> >>>>>>>>>>>>> incremental query optimizer to a type of periodic query
> called
> >>>>>>> the
> >>>>>>>>>>>> ‘‘range
> >>>>>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings
> >>> of
> >>>>>>> 80%
> >>>>>>>>>> on
> >>>>>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end
> >>> execution
> >>>>>>>>>> time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> All comments and suggestions are welcome. Thanks and happy
> >>>>>>> holidays!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Botong
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> ~~~~~~~~~~~~~~~
> >>>>>>> no mistakes
> >>>>>>> ~~~~~~~~~~~~~~~~~~
> >>>>>>>
> >>>>>>
> >>>
> >>
>
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to