Re: Proposal to extend Calcite into a incremental query optimizer

Julian Hyde Thu, 28 Jan 2021 11:18:05 -0800

PS The “editable doc” that Rui refers to is also a good idea. I think we should 
create it to continue discussion after the first meeting.


Julian

> On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]> wrote:
> 
> I think good next steps would be a PR and a meeting. The PR will allow us to 
> read the code, but I think we should do the first round of questions at the 
> meeting.  The meeting could perhaps start with a presentation of the paper 
> (do you have some slides you are planning to present at VLDB, Botong?) and 
> then move on to questions about the concepts, which alternatives were 
> considered, and how the concepts map onto other current and future concepts 
> in calcite. 
> 
> I don’t think we should start “reviewing” the PR line-by-line at this point. 
> We need to understand the high-level concepts and design choices. If we start 
> reviewing the PR we will get lost in the details. 
> 
> I know that integrating a major change is hard; I doubt that we will be able 
> to integrate everything, but we can build understanding about where calcite 
> needs to go, and I hope integrate a good amount of code to help us get there.
> 
> As I said before, after the integration I would like people to be able to 
> experiment with it and use it in their production systems.  That way, it will 
> not be an experiment that withers, but a feature set integrates with other 
> calcite features and gets stronger over time. 
> 
> Julian
> 
>> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote:
>> 
>> For me to participate in the discussion for the above questions, I will
>> need to read a lot more to know relevant context and likely ask lots of
>> questions :-).  A editable doc is probably good for questions and back and
>> forward discussion.
>> 
>> 
>> -Rui
>> 
>>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> wrote:
>>> 
>>> I am also happy to help push this work into Calcite (review code and doc,
>>> etc.).
>>> 
>>> While you can share your code so people can have more idea how it is
>>> implemented, I think it would be also nice to have a doc to discuss open
>>> questions above. Some points that I copy those to here:
>>> 
>>> 1. Can this solution be compatible with existing solutions in Calcite
>>> Streaming, materialized view maintenance, and multi-query optimization
>>> (Sigma and Delta relational operators, lattice, and Spool operator),
>>> 2. Did you find that you needed two separate cost models - one for “view
>>> maintenance” and another for “user queries” - since the objectives of each
>>> activity are so different?
>>> 3. whether this work will hasten the arrival of multi-objective parametric
>>> query optimization [1] in Calcite.
>>> 4. probably SQL shell support.
>>> 
>>> 
>>> [1]:
>>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>>> 
>>> 
>>> -Rui
>>> 
>>> 
>>> 
>>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote:
>>>> 
>>>> it would be very nice to see a POC of your work.
>>>> 
>>>> 
>>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> wrote:
>>>> 
>>>>> Hi Julian,
>>>>> 
>>>>> Just wondering if there are any updates? We are wondering if it would
>>>> help
>>>>> to post our code for a quick preview.
>>>>> 
>>>>> Thanks,
>>>>> Botong
>>>>> 
>>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote:
>>>>> 
>>>>>> Hi Julian,
>>>>>> 
>>>>>> Thanks for your interest! Sure let's figure out a plan that best
>>>> benefits
>>>>>> the community. Here are some clarifications that hopefully answer your
>>>>>> questions.
>>>>>> 
>>>>>> In our work (Tempura), users specify the set of time points to
>>>> consider
>>>>>> running and a cost function that expresses users' preference over
>>>> time,
>>>>>> Tempura will generate the best incremental plan that minimizes the
>>>>> overall
>>>>>> cost function.
>>>>>> 
>>>>>> In this incremental plan, the sub-plans at different time points can
>>>> be
>>>>>> different from each other, as opposed to identical plans in all delta
>>>>> runs
>>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
>>>> can
>>>>>> mimic the current streaming implementation by specifying two (logical)
>>>>> time
>>>>>> points in Tempura, representing the initial run and later delta runs
>>>>>> respectively. In general, note that Tempura supports various form of
>>>>>> incremental computing, not only the small-delta append-only data
>>>> model in
>>>>>> streaming systems. That's why we believe Tempura subsumes the current
>>>>>> streaming support, as well as any IVM implementations.
>>>>>> 
>>>>>> About the cost model, we did not come up with a seperate cost model,
>>>> but
>>>>>> rather extended the existing one. Similar to multi-objective
>>>>> optimization,
>>>>>> costs incurred at different time points are considered different
>>>>>> dimensions. Tempura lets users supply a function that converts this
>>>> cost
>>>>>> vector into a final cost. So under this function, any two incremental
>>>>> plans
>>>>>> are still comparable and there is an overall optimum. I guess we can
>>>> go
>>>>>> down the route of multi-objective parametric query optimization
>>>> instead
>>>>> if
>>>>>> there is a need.
>>>>>> 
>>>>>> Next on materialized views and multi-query optimization, since our
>>>>>> multi-time-point plan naturally involves materializing intermediate
>>>>> results
>>>>>> for later time points, we need to solve the problem of choosing
>>>>>> materializations and include the cost of saving and reusing the
>>>>>> materializations when costing and comparing plans. We borrowed the
>>>>>> multi-query optimization techniques to solve this problem even though
>>>> we
>>>>>> are looking at a single query. As a result, we think our work is
>>>>> orthogonal
>>>>>> to Calcite's facilities around utilizing existing views, lattice etc.
>>>> We
>>>>> do
>>>>>> feel that the multi-query optimization component can be adopted to
>>>> wider
>>>>>> use, but probably need more suggestions from the community.
>>>>>> 
>>>>>> Lastly, our current implementation is set up in java code, it should
>>>> be
>>>>>> straightforward to hook it up with SQL shell.
>>>>>> 
>>>>>> Thanks,
>>>>>> Botong
>>>>>> 
>>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Botong,
>>>>>>> 
>>>>>>> This is very exciting; congratulations on this research, and thank
>>>> you
>>>>>>> for contributing it back to Calcite.
>>>>>>> 
>>>>>>> The research touches several areas in Calcite: streaming,
>>>> materialized
>>>>>>> view maintenance, and multi-query optimization. As we have already
>>>> some
>>>>>>> solutions in those areas (Sigma and Delta relational operators,
>>>> lattice,
>>>>>>> and Spool operator), it will be interesting to see whether we can
>>>> make
>>>>> them
>>>>>>> compatible, or whether one concept can subsume others.
>>>>>>> 
>>>>>>> Your work differs from streaming queries in that your relations are
>>>> used
>>>>>>> by “external” user queries, whereas in pure streaming queries, the
>>>> only
>>>>>>> activity is the change propagation. Did you find that you needed two
>>>>>>> separate cost models - one for “view maintenance” and another for
>>>> “user
>>>>>>> queries” - since the objectives of each activity are so different?
>>>>>>> 
>>>>>>> I wonder whether this work will hasten the arrival of multi-objective
>>>>>>> parametric query optimization [1] in Calcite.
>>>>>>> 
>>>>>>> I will make time over the next few days to read and digest your
>>>> paper.
>>>>>>> Then I expect that we will have a back-and-forth process to create
>>>>>>> something that will be useful for the broader community.
>>>>>>> 
>>>>>>> One thing will be particularly useful: making this functionality
>>>>>>> available from a SQL shell, so that people can experiment with this
>>>>>>> functionality without writing Java code or setting up complex
>>>> databases
>>>>> and
>>>>>>> metadata. I have in mind something like the simple DDL operations
>>>> that
>>>>> are
>>>>>>> available in Calcite’s ’server’ module. I wonder whether we could
>>>> devise
>>>>>>> some kind of SQL syntax for a “multi-query”.
>>>>>>> 
>>>>>>> Julian
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Thanks Aron for pointing this out. To see the figure, please refer
>>>> to
>>>>>>> Fig
>>>>>>>> 3(a) in our paper:
>>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Botong
>>>>>>>> 
>>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you
>>>> open
>>>>> a
>>>>>>> JIRA
>>>>>>>>> for this, people who are interested in this can subscribe to the
>>>>> JIRA?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Regards!
>>>>>>>>> 
>>>>>>>>> Aron Tao
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道：
>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> This is a proposal to extend the Calcite optimizer into a general
>>>>>>>>>> incremental query optimizer, based on our research paper
>>>> published
>>>>> in
>>>>>>>>> VLDB
>>>>>>>>>> 2021:
>>>>>>>>>> Tempura: a general cost-based optimizer framework for incremental
>>>>> data
>>>>>>>>>> processing
>>>>>>>>>> 
>>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s
>>>> data
>>>>>>>>>> warehouse is planning to use this incremental query optimizer to
>>>>>>>>> alleviate
>>>>>>>>>> cluster-wise resource skewness:
>>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
>>>>>>>>> Computing
>>>>>>>>>> 
>>>>>>>>>> To our best knowledge, this is the first general cost-based
>>>>>>> incremental
>>>>>>>>>> optimizer that can find the best plan across multiple families of
>>>>>>>>>> incremental computing methods, including IVM, Streaming,
>>>> DBToaster,
>>>>>>> etc.
>>>>>>>>>> Experiments (in the paper) shows that the generated best plan is
>>>>>>>>>> consistently much better than the plans from each individual
>>>> method
>>>>>>>>> alone.
>>>>>>>>>> 
>>>>>>>>>> In general, incremental query planning is central to database
>>>> view
>>>>>>>>>> maintenance and stream processing systems, and are being adopted
>>>> in
>>>>>>>>> active
>>>>>>>>>> databases, resumable query execution, approximate query
>>>> processing,
>>>>>>> etc.
>>>>>>>>> We
>>>>>>>>>> are hoping that this feature can help widening the spectrum of
>>>>>>> Calcite,
>>>>>>>>>> solicit more use cases and adoption of Calcite.
>>>>>>>>>> 
>>>>>>>>>> Below is a brief description of the technical details. Please
>>>> refer
>>>>> to
>>>>>>>>> the
>>>>>>>>>> Tempura paper for more details. We are also working on a journal
>>>>>>> version
>>>>>>>>> of
>>>>>>>>>> the paper with more implementation details.
>>>>>>>>>> 
>>>>>>>>>> Currently the query plan generated by Calcite is meant to be
>>>>> executed
>>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be
>>>> extended
>>>>>>> with
>>>>>>>>>> temporal information so that it is capable of generating
>>>> incremental
>>>>>>>>> plans
>>>>>>>>>> that include multiple sub-plans to execute at different time
>>>> points.
>>>>>>>>>> 
>>>>>>>>>> The main idea is to view each table as one that changes over time
>>>>>>> (Time
>>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced
>>>> TvrMetaSet
>>>>>>> into
>>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related
>>>> RelSets
>>>>>>> of a
>>>>>>>>>> changing table (e.g. snapshot of the table at certain time,
>>>> delta of
>>>>>>> the
>>>>>>>>>> table between two time points, etc.).
>>>>>>>>>> 
>>>>>>>>>> [image: image.png]
>>>>>>>>>> 
>>>>>>>>>> For example in the above figure, each vertical line is a
>>>> TvrMetaSet
>>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). Horizontal
>>>>> lines
>>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users can
>>>>>>> write
>>>>>>>>> TVR
>>>>>>>>>> Rewrite Rules to describe valid transformations between these
>>>> dots.
>>>>>>> For
>>>>>>>>>> example, the blues lines are inter-TVR rules that describe how to
>>>>>>> compute
>>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red lines
>>>>> are
>>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All
>>>> TVR
>>>>>>>>> rewrite
>>>>>>>>>> rules are logical rules. All existing Calcite rules still work in
>>>>> the
>>>>>>> new
>>>>>>>>>> volcano system without modification.
>>>>>>>>>> 
>>>>>>>>>> All changes in this feature will consist of four parts:
>>>>>>>>>> 1. Memo extension with TvrMetaSet
>>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
>>>> RelNodes,
>>>>>>> as
>>>>>>>>>> well as links in between the nodes.
>>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule
>>>> engine
>>>>>>> API.
>>>>>>>>>> 4. Multi-query optimization, used to find the best incremental
>>>> plan
>>>>>>>>>> involving multiple time points.
>>>>>>>>>> 
>>>>>>>>>> Note that this feature is an extension in nature and thus when
>>>>>>> disabled,
>>>>>>>>>> does not change any existing Calcite behavior.
>>>>>>>>>> 
>>>>>>>>>> Other than scenarios in the paper, we also applied this
>>>>>>> Calcite-extended
>>>>>>>>>> incremental query optimizer to a type of periodic query called
>>>> the
>>>>>>>>> ‘‘range
>>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings of
>>>> 80%
>>>>>>> on
>>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end execution
>>>>>>> time.
>>>>>>>>>> 
>>>>>>>>>> All comments and suggestions are welcome. Thanks and happy
>>>> holidays!
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Botong
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> ~~~~~~~~~~~~~~~
>>>> no mistakes
>>>> ~~~~~~~~~~~~~~~~~~
>>>> 
>>>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to