Re: Proposal to extend Calcite into a incremental query optimizer

Julian Hyde Thu, 28 Jan 2021 11:16:30 -0800

I think good next steps would be a PR and a meeting. The PR will allow us to 
read the code, but I think we should do the first round of questions at the 
meeting.  The meeting could perhaps start with a presentation of the paper (do 
you have some slides you are planning to present at VLDB, Botong?) and then 
move on to questions about the concepts, which alternatives were considered, 
and how the concepts map onto other current and future concepts in calcite.


I don’t think we should start “reviewing” the PR line-by-line at this point. We 
need to understand the high-level concepts and design choices. If we start 
reviewing the PR we will get lost in the details. 

I know that integrating a major change is hard; I doubt that we will be able to 
integrate everything, but we can build understanding about where calcite needs 
to go, and I hope integrate a good amount of code to help us get there.

As I said before, after the integration I would like people to be able to 
experiment with it and use it in their production systems.  That way, it will 
not be an experiment that withers, but a feature set integrates with other 
calcite features and gets stronger over time. 

Julian

> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote:
> 
> For me to participate in the discussion for the above questions, I will
> need to read a lot more to know relevant context and likely ask lots of
> questions :-).  A editable doc is probably good for questions and back and
> forward discussion.
> 
> 
> -Rui
> 
>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> wrote:
>> 
>> I am also happy to help push this work into Calcite (review code and doc,
>> etc.).
>> 
>> While you can share your code so people can have more idea how it is
>> implemented, I think it would be also nice to have a doc to discuss open
>> questions above. Some points that I copy those to here:
>> 
>> 1. Can this solution be compatible with existing solutions in Calcite
>> Streaming, materialized view maintenance, and multi-query optimization
>> (Sigma and Delta relational operators, lattice, and Spool operator),
>> 2. Did you find that you needed two separate cost models - one for “view
>> maintenance” and another for “user queries” - since the objectives of each
>> activity are so different?
>> 3. whether this work will hasten the arrival of multi-objective parametric
>> query optimization [1] in Calcite.
>> 4. probably SQL shell support.
>> 
>> 
>> [1]:
>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>> 
>> 
>> -Rui
>> 
>> 
>> 
>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:
>>> 
>>> it would be very nice to see a POC of your work.
>>> 
>>> 
>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> wrote:
>>> 
>>>> Hi Julian,
>>>> 
>>>> Just wondering if there are any updates? We are wondering if it would
>>> help
>>>> to post our code for a quick preview.
>>>> 
>>>> Thanks,
>>>> Botong
>>>> 
>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote:
>>>> 
>>>>> Hi Julian,
>>>>> 
>>>>> Thanks for your interest! Sure let's figure out a plan that best
>>> benefits
>>>>> the community. Here are some clarifications that hopefully answer your
>>>>> questions.
>>>>> 
>>>>> In our work (Tempura), users specify the set of time points to
>>> consider
>>>>> running and a cost function that expresses users' preference over
>>> time,
>>>>> Tempura will generate the best incremental plan that minimizes the
>>>> overall
>>>>> cost function.
>>>>> 
>>>>> In this incremental plan, the sub-plans at different time points can
>>> be
>>>>> different from each other, as opposed to identical plans in all delta
>>>> runs
>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
>>> can
>>>>> mimic the current streaming implementation by specifying two (logical)
>>>> time
>>>>> points in Tempura, representing the initial run and later delta runs
>>>>> respectively. In general, note that Tempura supports various form of
>>>>> incremental computing, not only the small-delta append-only data
>>> model in
>>>>> streaming systems. That's why we believe Tempura subsumes the current
>>>>> streaming support, as well as any IVM implementations.
>>>>> 
>>>>> About the cost model, we did not come up with a seperate cost model,
>>> but
>>>>> rather extended the existing one. Similar to multi-objective
>>>> optimization,
>>>>> costs incurred at different time points are considered different
>>>>> dimensions. Tempura lets users supply a function that converts this
>>> cost
>>>>> vector into a final cost. So under this function, any two incremental
>>>> plans
>>>>> are still comparable and there is an overall optimum. I guess we can
>>> go
>>>>> down the route of multi-objective parametric query optimization
>>> instead
>>>> if
>>>>> there is a need.
>>>>> 
>>>>> Next on materialized views and multi-query optimization, since our
>>>>> multi-time-point plan naturally involves materializing intermediate
>>>> results
>>>>> for later time points, we need to solve the problem of choosing
>>>>> materializations and include the cost of saving and reusing the
>>>>> materializations when costing and comparing plans. We borrowed the
>>>>> multi-query optimization techniques to solve this problem even though
>>> we
>>>>> are looking at a single query. As a result, we think our work is
>>>> orthogonal
>>>>> to Calcite's facilities around utilizing existing views, lattice etc.
>>> We
>>>> do
>>>>> feel that the multi-query optimization component can be adopted to
>>> wider
>>>>> use, but probably need more suggestions from the community.
>>>>> 
>>>>> Lastly, our current implementation is set up in java code, it should
>>> be
>>>>> straightforward to hook it up with SQL shell.
>>>>> 
>>>>> Thanks,
>>>>> Botong
>>>>> 
>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Botong,
>>>>>> 
>>>>>> This is very exciting; congratulations on this research, and thank
>>> you
>>>>>> for contributing it back to Calcite.
>>>>>> 
>>>>>> The research touches several areas in Calcite: streaming,
>>> materialized
>>>>>> view maintenance, and multi-query optimization. As we have already
>>> some
>>>>>> solutions in those areas (Sigma and Delta relational operators,
>>> lattice,
>>>>>> and Spool operator), it will be interesting to see whether we can
>>> make
>>>> them
>>>>>> compatible, or whether one concept can subsume others.
>>>>>> 
>>>>>> Your work differs from streaming queries in that your relations are
>>> used
>>>>>> by “external” user queries, whereas in pure streaming queries, the
>>> only
>>>>>> activity is the change propagation. Did you find that you needed two
>>>>>> separate cost models - one for “view maintenance” and another for
>>> “user
>>>>>> queries” - since the objectives of each activity are so different?
>>>>>> 
>>>>>> I wonder whether this work will hasten the arrival of multi-objective
>>>>>> parametric query optimization [1] in Calcite.
>>>>>> 
>>>>>> I will make time over the next few days to read and digest your
>>> paper.
>>>>>> Then I expect that we will have a back-and-forth process to create
>>>>>> something that will be useful for the broader community.
>>>>>> 
>>>>>> One thing will be particularly useful: making this functionality
>>>>>> available from a SQL shell, so that people can experiment with this
>>>>>> functionality without writing Java code or setting up complex
>>> databases
>>>> and
>>>>>> metadata. I have in mind something like the simple DDL operations
>>> that
>>>> are
>>>>>> available in Calcite’s ’server’ module. I wonder whether we could
>>> devise
>>>>>> some kind of SQL syntax for a “multi-query”.
>>>>>> 
>>>>>> Julian
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>> 
>>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>> Thanks Aron for pointing this out. To see the figure, please refer
>>> to
>>>>>> Fig
>>>>>>> 3(a) in our paper:
>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
>>>>>>> 
>>>>>>> Best,
>>>>>>> Botong
>>>>>>> 
>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you
>>> open
>>>> a
>>>>>> JIRA
>>>>>>>> for this, people who are interested in this can subscribe to the
>>>> JIRA?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards!
>>>>>>>> 
>>>>>>>> Aron Tao
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> This is a proposal to extend the Calcite optimizer into a general
>>>>>>>>> incremental query optimizer, based on our research paper
>>> published
>>>> in
>>>>>>>> VLDB
>>>>>>>>> 2021:
>>>>>>>>> Tempura: a general cost-based optimizer framework for incremental
>>>> data
>>>>>>>>> processing
>>>>>>>>> 
>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s
>>> data
>>>>>>>>> warehouse is planning to use this incremental query optimizer to
>>>>>>>> alleviate
>>>>>>>>> cluster-wise resource skewness:
>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
>>>>>>>> Computing
>>>>>>>>> 
>>>>>>>>> To our best knowledge, this is the first general cost-based
>>>>>> incremental
>>>>>>>>> optimizer that can find the best plan across multiple families of
>>>>>>>>> incremental computing methods, including IVM, Streaming,
>>> DBToaster,
>>>>>> etc.
>>>>>>>>> Experiments (in the paper) shows that the generated best plan is
>>>>>>>>> consistently much better than the plans from each individual
>>> method
>>>>>>>> alone.
>>>>>>>>> 
>>>>>>>>> In general, incremental query planning is central to database
>>> view
>>>>>>>>> maintenance and stream processing systems, and are being adopted
>>> in
>>>>>>>> active
>>>>>>>>> databases, resumable query execution, approximate query
>>> processing,
>>>>>> etc.
>>>>>>>> We
>>>>>>>>> are hoping that this feature can help widening the spectrum of
>>>>>> Calcite,
>>>>>>>>> solicit more use cases and adoption of Calcite.
>>>>>>>>> 
>>>>>>>>> Below is a brief description of the technical details. Please
>>> refer
>>>> to
>>>>>>>> the
>>>>>>>>> Tempura paper for more details. We are also working on a journal
>>>>>> version
>>>>>>>> of
>>>>>>>>> the paper with more implementation details.
>>>>>>>>> 
>>>>>>>>> Currently the query plan generated by Calcite is meant to be
>>>> executed
>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be
>>> extended
>>>>>> with
>>>>>>>>> temporal information so that it is capable of generating
>>> incremental
>>>>>>>> plans
>>>>>>>>> that include multiple sub-plans to execute at different time
>>> points.
>>>>>>>>> 
>>>>>>>>> The main idea is to view each table as one that changes over time
>>>>>> (Time
>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced
>>> TvrMetaSet
>>>>>> into
>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related
>>> RelSets
>>>>>> of a
>>>>>>>>> changing table (e.g. snapshot of the table at certain time,
>>> delta of
>>>>>> the
>>>>>>>>> table between two time points, etc.).
>>>>>>>>> 
>>>>>>>>> [image: image.png]
>>>>>>>>> 
>>>>>>>>> For example in the above figure, each vertical line is a
>>> TvrMetaSet
>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). Horizontal
>>>> lines
>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users can
>>>>>> write
>>>>>>>> TVR
>>>>>>>>> Rewrite Rules to describe valid transformations between these
>>> dots.
>>>>>> For
>>>>>>>>> example, the blues lines are inter-TVR rules that describe how to
>>>>>> compute
>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red lines
>>>> are
>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All
>>> TVR
>>>>>>>> rewrite
>>>>>>>>> rules are logical rules. All existing Calcite rules still work in
>>>> the
>>>>>> new
>>>>>>>>> volcano system without modification.
>>>>>>>>> 
>>>>>>>>> All changes in this feature will consist of four parts:
>>>>>>>>> 1. Memo extension with TvrMetaSet
>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
>>> RelNodes,
>>>>>> as
>>>>>>>>> well as links in between the nodes.
>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule
>>> engine
>>>>>> API.
>>>>>>>>> 4. Multi-query optimization, used to find the best incremental
>>> plan
>>>>>>>>> involving multiple time points.
>>>>>>>>> 
>>>>>>>>> Note that this feature is an extension in nature and thus when
>>>>>> disabled,
>>>>>>>>> does not change any existing Calcite behavior.
>>>>>>>>> 
>>>>>>>>> Other than scenarios in the paper, we also applied this
>>>>>> Calcite-extended
>>>>>>>>> incremental query optimizer to a type of periodic query called
>>> the
>>>>>>>> ‘‘range
>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings of
>>> 80%
>>>>>> on
>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end execution
>>>>>> time.
>>>>>>>>> 
>>>>>>>>> All comments and suggestions are welcome. Thanks and happy
>>> holidays!
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Botong
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> ~~~~~~~~~~~~~~~
>>> no mistakes
>>> ~~~~~~~~~~~~~~~~~~
>>> 
>>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to