Re: Proposal to extend Calcite into a incremental query optimizer

Botong Huang Sat, 30 Jan 2021 21:22:03 -0800

Hi Julian and Rui,

Sounds good to us. Please give us some time to prepare some slides for the
meeting.


I've created a doc below for discussion. Please feel free to add more in
here:
https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing

Thanks,
Botong

On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]> wrote:

> PS The “editable doc” that Rui refers to is also a good idea. I think we
> should create it to continue discussion after the first meeting.
>
> Julian
>
> > On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]>
> wrote:
> >
> > I think good next steps would be a PR and a meeting. The PR will allow
> us to read the code, but I think we should do the first round of questions
> at the meeting.  The meeting could perhaps start with a presentation of the
> paper (do you have some slides you are planning to present at VLDB,
> Botong?) and then move on to questions about the concepts, which
> alternatives were considered, and how the concepts map onto other current
> and future concepts in calcite.
> >
> > I don’t think we should start “reviewing” the PR line-by-line at this
> point. We need to understand the high-level concepts and design choices. If
> we start reviewing the PR we will get lost in the details.
> >
> > I know that integrating a major change is hard; I doubt that we will be
> able to integrate everything, but we can build understanding about where
> calcite needs to go, and I hope integrate a good amount of code to help us
> get there.
> >
> > As I said before, after the integration I would like people to be able
> to experiment with it and use it in their production systems.  That way, it
> will not be an experiment that withers, but a feature set integrates with
> other calcite features and gets stronger over time.
> >
> > Julian
> >
> >> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote:
> >>
> >> For me to participate in the discussion for the above questions, I will
> >> need to read a lot more to know relevant context and likely ask lots of
> >> questions :-).  A editable doc is probably good for questions and back
> and
> >> forward discussion.
> >>
> >>
> >> -Rui
> >>
> >>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]>
> wrote:
> >>>
> >>> I am also happy to help push this work into Calcite (review code and
> doc,
> >>> etc.).
> >>>
> >>> While you can share your code so people can have more idea how it is
> >>> implemented, I think it would be also nice to have a doc to discuss
> open
> >>> questions above. Some points that I copy those to here:
> >>>
> >>> 1. Can this solution be compatible with existing solutions in Calcite
> >>> Streaming, materialized view maintenance, and multi-query optimization
> >>> (Sigma and Delta relational operators, lattice, and Spool operator),
> >>> 2. Did you find that you needed two separate cost models - one for
> “view
> >>> maintenance” and another for “user queries” - since the objectives of
> each
> >>> activity are so different?
> >>> 3. whether this work will hasten the arrival of multi-objective
> parametric
> >>> query optimization [1] in Calcite.
> >>> 4. probably SQL shell support.
> >>>
> >>>
> >>> [1]:
> >>>
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
> >>>
> >>>
> >>> -Rui
> >>>
> >>>
> >>>
> >>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:
> >>>>
> >>>> it would be very nice to see a POC of your work.
> >>>>
> >>>>
> >>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]>
> wrote:
> >>>>
> >>>>> Hi Julian,
> >>>>>
> >>>>> Just wondering if there are any updates? We are wondering if it would
> >>>> help
> >>>>> to post our code for a quick preview.
> >>>>>
> >>>>> Thanks,
> >>>>> Botong
> >>>>>
> >>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]>
> wrote:
> >>>>>
> >>>>>> Hi Julian,
> >>>>>>
> >>>>>> Thanks for your interest! Sure let's figure out a plan that best
> >>>> benefits
> >>>>>> the community. Here are some clarifications that hopefully answer
> your
> >>>>>> questions.
> >>>>>>
> >>>>>> In our work (Tempura), users specify the set of time points to
> >>>> consider
> >>>>>> running and a cost function that expresses users' preference over
> >>>> time,
> >>>>>> Tempura will generate the best incremental plan that minimizes the
> >>>>> overall
> >>>>>> cost function.
> >>>>>>
> >>>>>> In this incremental plan, the sub-plans at different time points can
> >>>> be
> >>>>>> different from each other, as opposed to identical plans in all
> delta
> >>>>> runs
> >>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper,
> we
> >>>> can
> >>>>>> mimic the current streaming implementation by specifying two
> (logical)
> >>>>> time
> >>>>>> points in Tempura, representing the initial run and later delta runs
> >>>>>> respectively. In general, note that Tempura supports various form of
> >>>>>> incremental computing, not only the small-delta append-only data
> >>>> model in
> >>>>>> streaming systems. That's why we believe Tempura subsumes the
> current
> >>>>>> streaming support, as well as any IVM implementations.
> >>>>>>
> >>>>>> About the cost model, we did not come up with a seperate cost model,
> >>>> but
> >>>>>> rather extended the existing one. Similar to multi-objective
> >>>>> optimization,
> >>>>>> costs incurred at different time points are considered different
> >>>>>> dimensions. Tempura lets users supply a function that converts this
> >>>> cost
> >>>>>> vector into a final cost. So under this function, any two
> incremental
> >>>>> plans
> >>>>>> are still comparable and there is an overall optimum. I guess we can
> >>>> go
> >>>>>> down the route of multi-objective parametric query optimization
> >>>> instead
> >>>>> if
> >>>>>> there is a need.
> >>>>>>
> >>>>>> Next on materialized views and multi-query optimization, since our
> >>>>>> multi-time-point plan naturally involves materializing intermediate
> >>>>> results
> >>>>>> for later time points, we need to solve the problem of choosing
> >>>>>> materializations and include the cost of saving and reusing the
> >>>>>> materializations when costing and comparing plans. We borrowed the
> >>>>>> multi-query optimization techniques to solve this problem even
> though
> >>>> we
> >>>>>> are looking at a single query. As a result, we think our work is
> >>>>> orthogonal
> >>>>>> to Calcite's facilities around utilizing existing views, lattice
> etc.
> >>>> We
> >>>>> do
> >>>>>> feel that the multi-query optimization component can be adopted to
> >>>> wider
> >>>>>> use, but probably need more suggestions from the community.
> >>>>>>
> >>>>>> Lastly, our current implementation is set up in java code, it should
> >>>> be
> >>>>>> straightforward to hook it up with SQL shell.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Botong
> >>>>>>
> >>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Botong,
> >>>>>>>
> >>>>>>> This is very exciting; congratulations on this research, and thank
> >>>> you
> >>>>>>> for contributing it back to Calcite.
> >>>>>>>
> >>>>>>> The research touches several areas in Calcite: streaming,
> >>>> materialized
> >>>>>>> view maintenance, and multi-query optimization. As we have already
> >>>> some
> >>>>>>> solutions in those areas (Sigma and Delta relational operators,
> >>>> lattice,
> >>>>>>> and Spool operator), it will be interesting to see whether we can
> >>>> make
> >>>>> them
> >>>>>>> compatible, or whether one concept can subsume others.
> >>>>>>>
> >>>>>>> Your work differs from streaming queries in that your relations are
> >>>> used
> >>>>>>> by “external” user queries, whereas in pure streaming queries, the
> >>>> only
> >>>>>>> activity is the change propagation. Did you find that you needed
> two
> >>>>>>> separate cost models - one for “view maintenance” and another for
> >>>> “user
> >>>>>>> queries” - since the objectives of each activity are so different?
> >>>>>>>
> >>>>>>> I wonder whether this work will hasten the arrival of
> multi-objective
> >>>>>>> parametric query optimization [1] in Calcite.
> >>>>>>>
> >>>>>>> I will make time over the next few days to read and digest your
> >>>> paper.
> >>>>>>> Then I expect that we will have a back-and-forth process to create
> >>>>>>> something that will be useful for the broader community.
> >>>>>>>
> >>>>>>> One thing will be particularly useful: making this functionality
> >>>>>>> available from a SQL shell, so that people can experiment with this
> >>>>>>> functionality without writing Java code or setting up complex
> >>>> databases
> >>>>> and
> >>>>>>> metadata. I have in mind something like the simple DDL operations
> >>>> that
> >>>>> are
> >>>>>>> available in Calcite’s ’server’ module. I wonder whether we could
> >>>> devise
> >>>>>>> some kind of SQL syntax for a “multi-query”.
> >>>>>>>
> >>>>>>> Julian
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>>
> >>>>
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Thanks Aron for pointing this out. To see the figure, please refer
> >>>> to
> >>>>>>> Fig
> >>>>>>>> 3(a) in our paper:
> >>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Botong
> >>>>>>>>
> >>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you
> >>>> open
> >>>>> a
> >>>>>>> JIRA
> >>>>>>>>> for this, people who are interested in this can subscribe to the
> >>>>> JIRA?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Regards!
> >>>>>>>>>
> >>>>>>>>> Aron Tao
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
> >>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> This is a proposal to extend the Calcite optimizer into a
> general
> >>>>>>>>>> incremental query optimizer, based on our research paper
> >>>> published
> >>>>> in
> >>>>>>>>> VLDB
> >>>>>>>>>> 2021:
> >>>>>>>>>> Tempura: a general cost-based optimizer framework for
> incremental
> >>>>> data
> >>>>>>>>>> processing
> >>>>>>>>>>
> >>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s
> >>>> data
> >>>>>>>>>> warehouse is planning to use this incremental query optimizer to
> >>>>>>>>> alleviate
> >>>>>>>>>> cluster-wise resource skewness:
> >>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
> >>>>>>>>> Computing
> >>>>>>>>>>
> >>>>>>>>>> To our best knowledge, this is the first general cost-based
> >>>>>>> incremental
> >>>>>>>>>> optimizer that can find the best plan across multiple families
> of
> >>>>>>>>>> incremental computing methods, including IVM, Streaming,
> >>>> DBToaster,
> >>>>>>> etc.
> >>>>>>>>>> Experiments (in the paper) shows that the generated best plan is
> >>>>>>>>>> consistently much better than the plans from each individual
> >>>> method
> >>>>>>>>> alone.
> >>>>>>>>>>
> >>>>>>>>>> In general, incremental query planning is central to database
> >>>> view
> >>>>>>>>>> maintenance and stream processing systems, and are being adopted
> >>>> in
> >>>>>>>>> active
> >>>>>>>>>> databases, resumable query execution, approximate query
> >>>> processing,
> >>>>>>> etc.
> >>>>>>>>> We
> >>>>>>>>>> are hoping that this feature can help widening the spectrum of
> >>>>>>> Calcite,
> >>>>>>>>>> solicit more use cases and adoption of Calcite.
> >>>>>>>>>>
> >>>>>>>>>> Below is a brief description of the technical details. Please
> >>>> refer
> >>>>> to
> >>>>>>>>> the
> >>>>>>>>>> Tempura paper for more details. We are also working on a journal
> >>>>>>> version
> >>>>>>>>> of
> >>>>>>>>>> the paper with more implementation details.
> >>>>>>>>>>
> >>>>>>>>>> Currently the query plan generated by Calcite is meant to be
> >>>>> executed
> >>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be
> >>>> extended
> >>>>>>> with
> >>>>>>>>>> temporal information so that it is capable of generating
> >>>> incremental
> >>>>>>>>> plans
> >>>>>>>>>> that include multiple sub-plans to execute at different time
> >>>> points.
> >>>>>>>>>>
> >>>>>>>>>> The main idea is to view each table as one that changes over
> time
> >>>>>>> (Time
> >>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced
> >>>> TvrMetaSet
> >>>>>>> into
> >>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related
> >>>> RelSets
> >>>>>>> of a
> >>>>>>>>>> changing table (e.g. snapshot of the table at certain time,
> >>>> delta of
> >>>>>>> the
> >>>>>>>>>> table between two time points, etc.).
> >>>>>>>>>>
> >>>>>>>>>> [image: image.png]
> >>>>>>>>>>
> >>>>>>>>>> For example in the above figure, each vertical line is a
> >>>> TvrMetaSet
> >>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). Horizontal
> >>>>> lines
> >>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users
> can
> >>>>>>> write
> >>>>>>>>> TVR
> >>>>>>>>>> Rewrite Rules to describe valid transformations between these
> >>>> dots.
> >>>>>>> For
> >>>>>>>>>> example, the blues lines are inter-TVR rules that describe how
> to
> >>>>>>> compute
> >>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red
> lines
> >>>>> are
> >>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All
> >>>> TVR
> >>>>>>>>> rewrite
> >>>>>>>>>> rules are logical rules. All existing Calcite rules still work
> in
> >>>>> the
> >>>>>>> new
> >>>>>>>>>> volcano system without modification.
> >>>>>>>>>>
> >>>>>>>>>> All changes in this feature will consist of four parts:
> >>>>>>>>>> 1. Memo extension with TvrMetaSet
> >>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
> >>>> RelNodes,
> >>>>>>> as
> >>>>>>>>>> well as links in between the nodes.
> >>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule
> >>>> engine
> >>>>>>> API.
> >>>>>>>>>> 4. Multi-query optimization, used to find the best incremental
> >>>> plan
> >>>>>>>>>> involving multiple time points.
> >>>>>>>>>>
> >>>>>>>>>> Note that this feature is an extension in nature and thus when
> >>>>>>> disabled,
> >>>>>>>>>> does not change any existing Calcite behavior.
> >>>>>>>>>>
> >>>>>>>>>> Other than scenarios in the paper, we also applied this
> >>>>>>> Calcite-extended
> >>>>>>>>>> incremental query optimizer to a type of periodic query called
> >>>> the
> >>>>>>>>> ‘‘range
> >>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings of
> >>>> 80%
> >>>>>>> on
> >>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end
> execution
> >>>>>>> time.
> >>>>>>>>>>
> >>>>>>>>>> All comments and suggestions are welcome. Thanks and happy
> >>>> holidays!
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Botong
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> ~~~~~~~~~~~~~~~
> >>>> no mistakes
> >>>> ~~~~~~~~~~~~~~~~~~
> >>>>
> >>>
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to