Re: Proposal to extend Calcite into a incremental query optimizer

Botong Huang Sat, 03 Apr 2021 17:16:40 -0700

Hi all,

Apology for the delay. It took us some time to clean up our code base and
publicly release it (which will be out soon) for a quick peek.


We are ready to present our work. Let's schedule a time for a Zoom
meeting and discuss how to integrate Tempura into Calcite.

Since some of our team members are in China, we prefer the time slot of
7:00pm-11:30pm PST any day. I've added our time preference in the shared
doc below.
https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing

We encourage everyone to add their time preferences (during 04/15-04/30) in
this doc. In a week or so, we will try to settle a time that works for
most.

Thanks,
Botong

On Sat, Jan 30, 2021 at 9:19 PM Botong Huang <[email protected]> wrote:

> Hi Julian and Rui,
>
> Sounds good to us. Please give us some time to prepare some slides for the
> meeting.
>
> I've created a doc below for discussion. Please feel free to add more in
> here:
>
> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing
>
> Thanks,
> Botong
>
> On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]>
> wrote:
>
>> PS The “editable doc” that Rui refers to is also a good idea. I think we
>> should create it to continue discussion after the first meeting.
>>
>> Julian
>>
>> > On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]>
>> wrote:
>> >
>> > I think good next steps would be a PR and a meeting. The PR will allow
>> us to read the code, but I think we should do the first round of questions
>> at the meeting.  The meeting could perhaps start with a presentation of the
>> paper (do you have some slides you are planning to present at VLDB,
>> Botong?) and then move on to questions about the concepts, which
>> alternatives were considered, and how the concepts map onto other current
>> and future concepts in calcite.
>> >
>> > I don’t think we should start “reviewing” the PR line-by-line at this
>> point. We need to understand the high-level concepts and design choices. If
>> we start reviewing the PR we will get lost in the details.
>> >
>> > I know that integrating a major change is hard; I doubt that we will be
>> able to integrate everything, but we can build understanding about where
>> calcite needs to go, and I hope integrate a good amount of code to help us
>> get there.
>> >
>> > As I said before, after the integration I would like people to be able
>> to experiment with it and use it in their production systems.  That way, it
>> will not be an experiment that withers, but a feature set integrates with
>> other calcite features and gets stronger over time.
>> >
>> > Julian
>> >
>> >> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote:
>> >>
>> >> For me to participate in the discussion for the above questions, I
>> will
>> >> need to read a lot more to know relevant context and likely ask lots of
>> >> questions :-).  A editable doc is probably good for questions and back
>> and
>> >> forward discussion.
>> >>
>> >>
>> >> -Rui
>> >>
>> >>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]>
>> wrote:
>> >>>
>> >>> I am also happy to help push this work into Calcite (review code and
>> doc,
>> >>> etc.).
>> >>>
>> >>> While you can share your code so people can have more idea how it is
>> >>> implemented, I think it would be also nice to have a doc to discuss
>> open
>> >>> questions above. Some points that I copy those to here:
>> >>>
>> >>> 1. Can this solution be compatible with existing solutions in Calcite
>> >>> Streaming, materialized view maintenance, and multi-query optimization
>> >>> (Sigma and Delta relational operators, lattice, and Spool operator),
>> >>> 2. Did you find that you needed two separate cost models - one for
>> “view
>> >>> maintenance” and another for “user queries” - since the objectives of
>> each
>> >>> activity are so different?
>> >>> 3. whether this work will hasten the arrival of multi-objective
>> parametric
>> >>> query optimization [1] in Calcite.
>> >>> 4. probably SQL shell support.
>> >>>
>> >>>
>> >>> [1]:
>> >>>
>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>> >>>
>> >>>
>> >>> -Rui
>> >>>
>> >>>
>> >>>
>> >>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:
>> >>>>
>> >>>> it would be very nice to see a POC of your work.
>> >>>>
>> >>>>
>> >>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]>
>> wrote:
>> >>>>
>> >>>>> Hi Julian,
>> >>>>>
>> >>>>> Just wondering if there are any updates? We are wondering if it
>> would
>> >>>> help
>> >>>>> to post our code for a quick preview.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Botong
>> >>>>>
>> >>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]>
>> wrote:
>> >>>>>
>> >>>>>> Hi Julian,
>> >>>>>>
>> >>>>>> Thanks for your interest! Sure let's figure out a plan that best
>> >>>> benefits
>> >>>>>> the community. Here are some clarifications that hopefully answer
>> your
>> >>>>>> questions.
>> >>>>>>
>> >>>>>> In our work (Tempura), users specify the set of time points to
>> >>>> consider
>> >>>>>> running and a cost function that expresses users' preference over
>> >>>> time,
>> >>>>>> Tempura will generate the best incremental plan that minimizes the
>> >>>>> overall
>> >>>>>> cost function.
>> >>>>>>
>> >>>>>> In this incremental plan, the sub-plans at different time points
>> can
>> >>>> be
>> >>>>>> different from each other, as opposed to identical plans in all
>> delta
>> >>>>> runs
>> >>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper,
>> we
>> >>>> can
>> >>>>>> mimic the current streaming implementation by specifying two
>> (logical)
>> >>>>> time
>> >>>>>> points in Tempura, representing the initial run and later delta
>> runs
>> >>>>>> respectively. In general, note that Tempura supports various form
>> of
>> >>>>>> incremental computing, not only the small-delta append-only data
>> >>>> model in
>> >>>>>> streaming systems. That's why we believe Tempura subsumes the
>> current
>> >>>>>> streaming support, as well as any IVM implementations.
>> >>>>>>
>> >>>>>> About the cost model, we did not come up with a seperate cost
>> model,
>> >>>> but
>> >>>>>> rather extended the existing one. Similar to multi-objective
>> >>>>> optimization,
>> >>>>>> costs incurred at different time points are considered different
>> >>>>>> dimensions. Tempura lets users supply a function that converts this
>> >>>> cost
>> >>>>>> vector into a final cost. So under this function, any two
>> incremental
>> >>>>> plans
>> >>>>>> are still comparable and there is an overall optimum. I guess we
>> can
>> >>>> go
>> >>>>>> down the route of multi-objective parametric query optimization
>> >>>> instead
>> >>>>> if
>> >>>>>> there is a need.
>> >>>>>>
>> >>>>>> Next on materialized views and multi-query optimization, since our
>> >>>>>> multi-time-point plan naturally involves materializing intermediate
>> >>>>> results
>> >>>>>> for later time points, we need to solve the problem of choosing
>> >>>>>> materializations and include the cost of saving and reusing the
>> >>>>>> materializations when costing and comparing plans. We borrowed the
>> >>>>>> multi-query optimization techniques to solve this problem even
>> though
>> >>>> we
>> >>>>>> are looking at a single query. As a result, we think our work is
>> >>>>> orthogonal
>> >>>>>> to Calcite's facilities around utilizing existing views, lattice
>> etc.
>> >>>> We
>> >>>>> do
>> >>>>>> feel that the multi-query optimization component can be adopted to
>> >>>> wider
>> >>>>>> use, but probably need more suggestions from the community.
>> >>>>>>
>> >>>>>> Lastly, our current implementation is set up in java code, it
>> should
>> >>>> be
>> >>>>>> straightforward to hook it up with SQL shell.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Botong
>> >>>>>>
>> >>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <
>> [email protected]>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Botong,
>> >>>>>>>
>> >>>>>>> This is very exciting; congratulations on this research, and thank
>> >>>> you
>> >>>>>>> for contributing it back to Calcite.
>> >>>>>>>
>> >>>>>>> The research touches several areas in Calcite: streaming,
>> >>>> materialized
>> >>>>>>> view maintenance, and multi-query optimization. As we have already
>> >>>> some
>> >>>>>>> solutions in those areas (Sigma and Delta relational operators,
>> >>>> lattice,
>> >>>>>>> and Spool operator), it will be interesting to see whether we can
>> >>>> make
>> >>>>> them
>> >>>>>>> compatible, or whether one concept can subsume others.
>> >>>>>>>
>> >>>>>>> Your work differs from streaming queries in that your relations
>> are
>> >>>> used
>> >>>>>>> by “external” user queries, whereas in pure streaming queries, the
>> >>>> only
>> >>>>>>> activity is the change propagation. Did you find that you needed
>> two
>> >>>>>>> separate cost models - one for “view maintenance” and another for
>> >>>> “user
>> >>>>>>> queries” - since the objectives of each activity are so different?
>> >>>>>>>
>> >>>>>>> I wonder whether this work will hasten the arrival of
>> multi-objective
>> >>>>>>> parametric query optimization [1] in Calcite.
>> >>>>>>>
>> >>>>>>> I will make time over the next few days to read and digest your
>> >>>> paper.
>> >>>>>>> Then I expect that we will have a back-and-forth process to create
>> >>>>>>> something that will be useful for the broader community.
>> >>>>>>>
>> >>>>>>> One thing will be particularly useful: making this functionality
>> >>>>>>> available from a SQL shell, so that people can experiment with
>> this
>> >>>>>>> functionality without writing Java code or setting up complex
>> >>>> databases
>> >>>>> and
>> >>>>>>> metadata. I have in mind something like the simple DDL operations
>> >>>> that
>> >>>>> are
>> >>>>>>> available in Calcite’s ’server’ module. I wonder whether we could
>> >>>> devise
>> >>>>>>> some kind of SQL syntax for a “multi-query”.
>> >>>>>>>
>> >>>>>>> Julian
>> >>>>>>>
>> >>>>>>> [1]
>> >>>>>>>
>> >>>>>
>> >>>>
>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]>
>> >>>> wrote:
>> >>>>>>>>
>> >>>>>>>> Thanks Aron for pointing this out. To see the figure, please
>> refer
>> >>>> to
>> >>>>>>> Fig
>> >>>>>>>> 3(a) in our paper:
>> >>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> Botong
>> >>>>>>>>
>> >>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]>
>> >>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you
>> >>>> open
>> >>>>> a
>> >>>>>>> JIRA
>> >>>>>>>>> for this, people who are interested in this can subscribe to the
>> >>>>> JIRA?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Regards!
>> >>>>>>>>>
>> >>>>>>>>> Aron Tao
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
>> >>>>>>>>>
>> >>>>>>>>>> Hi all,
>> >>>>>>>>>>
>> >>>>>>>>>> This is a proposal to extend the Calcite optimizer into a
>> general
>> >>>>>>>>>> incremental query optimizer, based on our research paper
>> >>>> published
>> >>>>> in
>> >>>>>>>>> VLDB
>> >>>>>>>>>> 2021:
>> >>>>>>>>>> Tempura: a general cost-based optimizer framework for
>> incremental
>> >>>>> data
>> >>>>>>>>>> processing
>> >>>>>>>>>>
>> >>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s
>> >>>> data
>> >>>>>>>>>> warehouse is planning to use this incremental query optimizer
>> to
>> >>>>>>>>> alleviate
>> >>>>>>>>>> cluster-wise resource skewness:
>> >>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware
>> Incremental
>> >>>>>>>>> Computing
>> >>>>>>>>>>
>> >>>>>>>>>> To our best knowledge, this is the first general cost-based
>> >>>>>>> incremental
>> >>>>>>>>>> optimizer that can find the best plan across multiple families
>> of
>> >>>>>>>>>> incremental computing methods, including IVM, Streaming,
>> >>>> DBToaster,
>> >>>>>>> etc.
>> >>>>>>>>>> Experiments (in the paper) shows that the generated best plan
>> is
>> >>>>>>>>>> consistently much better than the plans from each individual
>> >>>> method
>> >>>>>>>>> alone.
>> >>>>>>>>>>
>> >>>>>>>>>> In general, incremental query planning is central to database
>> >>>> view
>> >>>>>>>>>> maintenance and stream processing systems, and are being
>> adopted
>> >>>> in
>> >>>>>>>>> active
>> >>>>>>>>>> databases, resumable query execution, approximate query
>> >>>> processing,
>> >>>>>>> etc.
>> >>>>>>>>> We
>> >>>>>>>>>> are hoping that this feature can help widening the spectrum of
>> >>>>>>> Calcite,
>> >>>>>>>>>> solicit more use cases and adoption of Calcite.
>> >>>>>>>>>>
>> >>>>>>>>>> Below is a brief description of the technical details. Please
>> >>>> refer
>> >>>>> to
>> >>>>>>>>> the
>> >>>>>>>>>> Tempura paper for more details. We are also working on a
>> journal
>> >>>>>>> version
>> >>>>>>>>> of
>> >>>>>>>>>> the paper with more implementation details.
>> >>>>>>>>>>
>> >>>>>>>>>> Currently the query plan generated by Calcite is meant to be
>> >>>>> executed
>> >>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be
>> >>>> extended
>> >>>>>>> with
>> >>>>>>>>>> temporal information so that it is capable of generating
>> >>>> incremental
>> >>>>>>>>> plans
>> >>>>>>>>>> that include multiple sub-plans to execute at different time
>> >>>> points.
>> >>>>>>>>>>
>> >>>>>>>>>> The main idea is to view each table as one that changes over
>> time
>> >>>>>>> (Time
>> >>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced
>> >>>> TvrMetaSet
>> >>>>>>> into
>> >>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related
>> >>>> RelSets
>> >>>>>>> of a
>> >>>>>>>>>> changing table (e.g. snapshot of the table at certain time,
>> >>>> delta of
>> >>>>>>> the
>> >>>>>>>>>> table between two time points, etc.).
>> >>>>>>>>>>
>> >>>>>>>>>> [image: image.png]
>> >>>>>>>>>>
>> >>>>>>>>>> For example in the above figure, each vertical line is a
>> >>>> TvrMetaSet
>> >>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.).
>> Horizontal
>> >>>>> lines
>> >>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users
>> can
>> >>>>>>> write
>> >>>>>>>>> TVR
>> >>>>>>>>>> Rewrite Rules to describe valid transformations between these
>> >>>> dots.
>> >>>>>>> For
>> >>>>>>>>>> example, the blues lines are inter-TVR rules that describe how
>> to
>> >>>>>>> compute
>> >>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red
>> lines
>> >>>>> are
>> >>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All
>> >>>> TVR
>> >>>>>>>>> rewrite
>> >>>>>>>>>> rules are logical rules. All existing Calcite rules still work
>> in
>> >>>>> the
>> >>>>>>> new
>> >>>>>>>>>> volcano system without modification.
>> >>>>>>>>>>
>> >>>>>>>>>> All changes in this feature will consist of four parts:
>> >>>>>>>>>> 1. Memo extension with TvrMetaSet
>> >>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
>> >>>> RelNodes,
>> >>>>>>> as
>> >>>>>>>>>> well as links in between the nodes.
>> >>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule
>> >>>> engine
>> >>>>>>> API.
>> >>>>>>>>>> 4. Multi-query optimization, used to find the best incremental
>> >>>> plan
>> >>>>>>>>>> involving multiple time points.
>> >>>>>>>>>>
>> >>>>>>>>>> Note that this feature is an extension in nature and thus when
>> >>>>>>> disabled,
>> >>>>>>>>>> does not change any existing Calcite behavior.
>> >>>>>>>>>>
>> >>>>>>>>>> Other than scenarios in the paper, we also applied this
>> >>>>>>> Calcite-extended
>> >>>>>>>>>> incremental query optimizer to a type of periodic query called
>> >>>> the
>> >>>>>>>>> ‘‘range
>> >>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings
>> of
>> >>>> 80%
>> >>>>>>> on
>> >>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end
>> execution
>> >>>>>>> time.
>> >>>>>>>>>>
>> >>>>>>>>>> All comments and suggestions are welcome. Thanks and happy
>> >>>> holidays!
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Botong
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> ~~~~~~~~~~~~~~~
>> >>>> no mistakes
>> >>>> ~~~~~~~~~~~~~~~~~~
>> >>>>
>> >>>
>>
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to