Re: Proposal to extend Calcite into a incremental query optimizer

Julian Hyde Mon, 05 Apr 2021 15:59:33 -0700

I have added my time preferences to the doc.

Before we meet, could you publish a PR for us to review?


Initial discussions will need to be about architecture and high-level design. 
So I would ask Calcite reviewers not to review the PR line-by-line (or to leave 
comments in GitHub) but try to understand the design holistically, and prepare 
questions/comments before the meeting.

Botong, Can you please create a Calcite JIRA case for this task? JIRA how we 
track long-running tasks such as this.

Julian


> On Apr 3, 2021, at 5:15 PM, Botong Huang <[email protected]> wrote:
> 
> Hi all,
> 
> Apology for the delay. It took us some time to clean up our code base and
> publicly release it (which will be out soon) for a quick peek.
> 
> We are ready to present our work. Let's schedule a time for a Zoom
> meeting and discuss how to integrate Tempura into Calcite.
> 
> Since some of our team members are in China, we prefer the time slot of
> 7:00pm-11:30pm PST any day. I've added our time preference in the shared
> doc below.
> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing
> 
> We encourage everyone to add their time preferences (during 04/15-04/30) in
> this doc. In a week or so, we will try to settle a time that works for
> most.
> 
> Thanks,
> Botong
> 
> On Sat, Jan 30, 2021 at 9:19 PM Botong Huang <[email protected]> wrote:
> 
>> Hi Julian and Rui,
>> 
>> Sounds good to us. Please give us some time to prepare some slides for the
>> meeting.
>> 
>> I've created a doc below for discussion. Please feel free to add more in
>> here:
>> 
>> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing
>> 
>> Thanks,
>> Botong
>> 
>> On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]>
>> wrote:
>> 
>>> PS The “editable doc” that Rui refers to is also a good idea. I think we
>>> should create it to continue discussion after the first meeting.
>>> 
>>> Julian
>>> 
>>>> On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]>
>>> wrote:
>>>> 
>>>> I think good next steps would be a PR and a meeting. The PR will allow
>>> us to read the code, but I think we should do the first round of questions
>>> at the meeting.  The meeting could perhaps start with a presentation of the
>>> paper (do you have some slides you are planning to present at VLDB,
>>> Botong?) and then move on to questions about the concepts, which
>>> alternatives were considered, and how the concepts map onto other current
>>> and future concepts in calcite.
>>>> 
>>>> I don’t think we should start “reviewing” the PR line-by-line at this
>>> point. We need to understand the high-level concepts and design choices. If
>>> we start reviewing the PR we will get lost in the details.
>>>> 
>>>> I know that integrating a major change is hard; I doubt that we will be
>>> able to integrate everything, but we can build understanding about where
>>> calcite needs to go, and I hope integrate a good amount of code to help us
>>> get there.
>>>> 
>>>> As I said before, after the integration I would like people to be able
>>> to experiment with it and use it in their production systems.  That way, it
>>> will not be an experiment that withers, but a feature set integrates with
>>> other calcite features and gets stronger over time.
>>>> 
>>>> Julian
>>>> 
>>>>> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote:
>>>>> 
>>>>> For me to participate in the discussion for the above questions, I
>>> will
>>>>> need to read a lot more to know relevant context and likely ask lots of
>>>>> questions :-).  A editable doc is probably good for questions and back
>>> and
>>>>> forward discussion.
>>>>> 
>>>>> 
>>>>> -Rui
>>>>> 
>>>>>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]>
>>> wrote:
>>>>>> 
>>>>>> I am also happy to help push this work into Calcite (review code and
>>> doc,
>>>>>> etc.).
>>>>>> 
>>>>>> While you can share your code so people can have more idea how it is
>>>>>> implemented, I think it would be also nice to have a doc to discuss
>>> open
>>>>>> questions above. Some points that I copy those to here:
>>>>>> 
>>>>>> 1. Can this solution be compatible with existing solutions in Calcite
>>>>>> Streaming, materialized view maintenance, and multi-query optimization
>>>>>> (Sigma and Delta relational operators, lattice, and Spool operator),
>>>>>> 2. Did you find that you needed two separate cost models - one for
>>> “view
>>>>>> maintenance” and another for “user queries” - since the objectives of
>>> each
>>>>>> activity are so different?
>>>>>> 3. whether this work will hasten the arrival of multi-objective
>>> parametric
>>>>>> query optimization [1] in Calcite.
>>>>>> 4. probably SQL shell support.
>>>>>> 
>>>>>> 
>>>>>> [1]:
>>>>>> 
>>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>>>>>> 
>>>>>> 
>>>>>> -Rui
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:
>>>>>>> 
>>>>>>> it would be very nice to see a POC of your work.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>>> Hi Julian,
>>>>>>>> 
>>>>>>>> Just wondering if there are any updates? We are wondering if it
>>> would
>>>>>>> help
>>>>>>>> to post our code for a quick preview.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Botong
>>>>>>>> 
>>>>>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Julian,
>>>>>>>>> 
>>>>>>>>> Thanks for your interest! Sure let's figure out a plan that best
>>>>>>> benefits
>>>>>>>>> the community. Here are some clarifications that hopefully answer
>>> your
>>>>>>>>> questions.
>>>>>>>>> 
>>>>>>>>> In our work (Tempura), users specify the set of time points to
>>>>>>> consider
>>>>>>>>> running and a cost function that expresses users' preference over
>>>>>>> time,
>>>>>>>>> Tempura will generate the best incremental plan that minimizes the
>>>>>>>> overall
>>>>>>>>> cost function.
>>>>>>>>> 
>>>>>>>>> In this incremental plan, the sub-plans at different time points
>>> can
>>>>>>> be
>>>>>>>>> different from each other, as opposed to identical plans in all
>>> delta
>>>>>>>> runs
>>>>>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper,
>>> we
>>>>>>> can
>>>>>>>>> mimic the current streaming implementation by specifying two
>>> (logical)
>>>>>>>> time
>>>>>>>>> points in Tempura, representing the initial run and later delta
>>> runs
>>>>>>>>> respectively. In general, note that Tempura supports various form
>>> of
>>>>>>>>> incremental computing, not only the small-delta append-only data
>>>>>>> model in
>>>>>>>>> streaming systems. That's why we believe Tempura subsumes the
>>> current
>>>>>>>>> streaming support, as well as any IVM implementations.
>>>>>>>>> 
>>>>>>>>> About the cost model, we did not come up with a seperate cost
>>> model,
>>>>>>> but
>>>>>>>>> rather extended the existing one. Similar to multi-objective
>>>>>>>> optimization,
>>>>>>>>> costs incurred at different time points are considered different
>>>>>>>>> dimensions. Tempura lets users supply a function that converts this
>>>>>>> cost
>>>>>>>>> vector into a final cost. So under this function, any two
>>> incremental
>>>>>>>> plans
>>>>>>>>> are still comparable and there is an overall optimum. I guess we
>>> can
>>>>>>> go
>>>>>>>>> down the route of multi-objective parametric query optimization
>>>>>>> instead
>>>>>>>> if
>>>>>>>>> there is a need.
>>>>>>>>> 
>>>>>>>>> Next on materialized views and multi-query optimization, since our
>>>>>>>>> multi-time-point plan naturally involves materializing intermediate
>>>>>>>> results
>>>>>>>>> for later time points, we need to solve the problem of choosing
>>>>>>>>> materializations and include the cost of saving and reusing the
>>>>>>>>> materializations when costing and comparing plans. We borrowed the
>>>>>>>>> multi-query optimization techniques to solve this problem even
>>> though
>>>>>>> we
>>>>>>>>> are looking at a single query. As a result, we think our work is
>>>>>>>> orthogonal
>>>>>>>>> to Calcite's facilities around utilizing existing views, lattice
>>> etc.
>>>>>>> We
>>>>>>>> do
>>>>>>>>> feel that the multi-query optimization component can be adopted to
>>>>>>> wider
>>>>>>>>> use, but probably need more suggestions from the community.
>>>>>>>>> 
>>>>>>>>> Lastly, our current implementation is set up in java code, it
>>> should
>>>>>>> be
>>>>>>>>> straightforward to hook it up with SQL shell.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Botong
>>>>>>>>> 
>>>>>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <
>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Botong,
>>>>>>>>>> 
>>>>>>>>>> This is very exciting; congratulations on this research, and thank
>>>>>>> you
>>>>>>>>>> for contributing it back to Calcite.
>>>>>>>>>> 
>>>>>>>>>> The research touches several areas in Calcite: streaming,
>>>>>>> materialized
>>>>>>>>>> view maintenance, and multi-query optimization. As we have already
>>>>>>> some
>>>>>>>>>> solutions in those areas (Sigma and Delta relational operators,
>>>>>>> lattice,
>>>>>>>>>> and Spool operator), it will be interesting to see whether we can
>>>>>>> make
>>>>>>>> them
>>>>>>>>>> compatible, or whether one concept can subsume others.
>>>>>>>>>> 
>>>>>>>>>> Your work differs from streaming queries in that your relations
>>> are
>>>>>>> used
>>>>>>>>>> by “external” user queries, whereas in pure streaming queries, the
>>>>>>> only
>>>>>>>>>> activity is the change propagation. Did you find that you needed
>>> two
>>>>>>>>>> separate cost models - one for “view maintenance” and another for
>>>>>>> “user
>>>>>>>>>> queries” - since the objectives of each activity are so different?
>>>>>>>>>> 
>>>>>>>>>> I wonder whether this work will hasten the arrival of
>>> multi-objective
>>>>>>>>>> parametric query optimization [1] in Calcite.
>>>>>>>>>> 
>>>>>>>>>> I will make time over the next few days to read and digest your
>>>>>>> paper.
>>>>>>>>>> Then I expect that we will have a back-and-forth process to create
>>>>>>>>>> something that will be useful for the broader community.
>>>>>>>>>> 
>>>>>>>>>> One thing will be particularly useful: making this functionality
>>>>>>>>>> available from a SQL shell, so that people can experiment with
>>> this
>>>>>>>>>> functionality without writing Java code or setting up complex
>>>>>>> databases
>>>>>>>> and
>>>>>>>>>> metadata. I have in mind something like the simple DDL operations
>>>>>>> that
>>>>>>>> are
>>>>>>>>>> available in Calcite’s ’server’ module. I wonder whether we could
>>>>>>> devise
>>>>>>>>>> some kind of SQL syntax for a “multi-query”.
>>>>>>>>>> 
>>>>>>>>>> Julian
>>>>>>>>>> 
>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thanks Aron for pointing this out. To see the figure, please
>>> refer
>>>>>>> to
>>>>>>>>>> Fig
>>>>>>>>>>> 3(a) in our paper:
>>>>>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Botong
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you
>>>>>>> open
>>>>>>>> a
>>>>>>>>>> JIRA
>>>>>>>>>>>> for this, people who are interested in this can subscribe to the
>>>>>>>> JIRA?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards!
>>>>>>>>>>>> 
>>>>>>>>>>>> Aron Tao
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This is a proposal to extend the Calcite optimizer into a
>>> general
>>>>>>>>>>>>> incremental query optimizer, based on our research paper
>>>>>>> published
>>>>>>>> in
>>>>>>>>>>>> VLDB
>>>>>>>>>>>>> 2021:
>>>>>>>>>>>>> Tempura: a general cost-based optimizer framework for
>>> incremental
>>>>>>>> data
>>>>>>>>>>>>> processing
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s
>>>>>>> data
>>>>>>>>>>>>> warehouse is planning to use this incremental query optimizer
>>> to
>>>>>>>>>>>> alleviate
>>>>>>>>>>>>> cluster-wise resource skewness:
>>>>>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware
>>> Incremental
>>>>>>>>>>>> Computing
>>>>>>>>>>>>> 
>>>>>>>>>>>>> To our best knowledge, this is the first general cost-based
>>>>>>>>>> incremental
>>>>>>>>>>>>> optimizer that can find the best plan across multiple families
>>> of
>>>>>>>>>>>>> incremental computing methods, including IVM, Streaming,
>>>>>>> DBToaster,
>>>>>>>>>> etc.
>>>>>>>>>>>>> Experiments (in the paper) shows that the generated best plan
>>> is
>>>>>>>>>>>>> consistently much better than the plans from each individual
>>>>>>> method
>>>>>>>>>>>> alone.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In general, incremental query planning is central to database
>>>>>>> view
>>>>>>>>>>>>> maintenance and stream processing systems, and are being
>>> adopted
>>>>>>> in
>>>>>>>>>>>> active
>>>>>>>>>>>>> databases, resumable query execution, approximate query
>>>>>>> processing,
>>>>>>>>>> etc.
>>>>>>>>>>>> We
>>>>>>>>>>>>> are hoping that this feature can help widening the spectrum of
>>>>>>>>>> Calcite,
>>>>>>>>>>>>> solicit more use cases and adoption of Calcite.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Below is a brief description of the technical details. Please
>>>>>>> refer
>>>>>>>> to
>>>>>>>>>>>> the
>>>>>>>>>>>>> Tempura paper for more details. We are also working on a
>>> journal
>>>>>>>>>> version
>>>>>>>>>>>> of
>>>>>>>>>>>>> the paper with more implementation details.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Currently the query plan generated by Calcite is meant to be
>>>>>>>> executed
>>>>>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be
>>>>>>> extended
>>>>>>>>>> with
>>>>>>>>>>>>> temporal information so that it is capable of generating
>>>>>>> incremental
>>>>>>>>>>>> plans
>>>>>>>>>>>>> that include multiple sub-plans to execute at different time
>>>>>>> points.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The main idea is to view each table as one that changes over
>>> time
>>>>>>>>>> (Time
>>>>>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced
>>>>>>> TvrMetaSet
>>>>>>>>>> into
>>>>>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related
>>>>>>> RelSets
>>>>>>>>>> of a
>>>>>>>>>>>>> changing table (e.g. snapshot of the table at certain time,
>>>>>>> delta of
>>>>>>>>>> the
>>>>>>>>>>>>> table between two time points, etc.).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For example in the above figure, each vertical line is a
>>>>>>> TvrMetaSet
>>>>>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.).
>>> Horizontal
>>>>>>>> lines
>>>>>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users
>>> can
>>>>>>>>>> write
>>>>>>>>>>>> TVR
>>>>>>>>>>>>> Rewrite Rules to describe valid transformations between these
>>>>>>> dots.
>>>>>>>>>> For
>>>>>>>>>>>>> example, the blues lines are inter-TVR rules that describe how
>>> to
>>>>>>>>>> compute
>>>>>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red
>>> lines
>>>>>>>> are
>>>>>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All
>>>>>>> TVR
>>>>>>>>>>>> rewrite
>>>>>>>>>>>>> rules are logical rules. All existing Calcite rules still work
>>> in
>>>>>>>> the
>>>>>>>>>> new
>>>>>>>>>>>>> volcano system without modification.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All changes in this feature will consist of four parts:
>>>>>>>>>>>>> 1. Memo extension with TvrMetaSet
>>>>>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
>>>>>>> RelNodes,
>>>>>>>>>> as
>>>>>>>>>>>>> well as links in between the nodes.
>>>>>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule
>>>>>>> engine
>>>>>>>>>> API.
>>>>>>>>>>>>> 4. Multi-query optimization, used to find the best incremental
>>>>>>> plan
>>>>>>>>>>>>> involving multiple time points.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Note that this feature is an extension in nature and thus when
>>>>>>>>>> disabled,
>>>>>>>>>>>>> does not change any existing Calcite behavior.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Other than scenarios in the paper, we also applied this
>>>>>>>>>> Calcite-extended
>>>>>>>>>>>>> incremental query optimizer to a type of periodic query called
>>>>>>> the
>>>>>>>>>>>> ‘‘range
>>>>>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings
>>> of
>>>>>>> 80%
>>>>>>>>>> on
>>>>>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end
>>> execution
>>>>>>>>>> time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All comments and suggestions are welcome. Thanks and happy
>>>>>>> holidays!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Botong
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> ~~~~~~~~~~~~~~~
>>>>>>> no mistakes
>>>>>>> ~~~~~~~~~~~~~~~~~~
>>>>>>> 
>>>>>> 
>>> 
>>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to