I have added my time preferences to the doc. Before we meet, could you publish a PR for us to review?
Initial discussions will need to be about architecture and high-level design. So I would ask Calcite reviewers not to review the PR line-by-line (or to leave comments in GitHub) but try to understand the design holistically, and prepare questions/comments before the meeting. Botong, Can you please create a Calcite JIRA case for this task? JIRA how we track long-running tasks such as this. Julian > On Apr 3, 2021, at 5:15 PM, Botong Huang <[email protected]> wrote: > > Hi all, > > Apology for the delay. It took us some time to clean up our code base and > publicly release it (which will be out soon) for a quick peek. > > We are ready to present our work. Let's schedule a time for a Zoom > meeting and discuss how to integrate Tempura into Calcite. > > Since some of our team members are in China, we prefer the time slot of > 7:00pm-11:30pm PST any day. I've added our time preference in the shared > doc below. > https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing > > We encourage everyone to add their time preferences (during 04/15-04/30) in > this doc. In a week or so, we will try to settle a time that works for > most. > > Thanks, > Botong > > On Sat, Jan 30, 2021 at 9:19 PM Botong Huang <[email protected]> wrote: > >> Hi Julian and Rui, >> >> Sounds good to us. Please give us some time to prepare some slides for the >> meeting. >> >> I've created a doc below for discussion. Please feel free to add more in >> here: >> >> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing >> >> Thanks, >> Botong >> >> On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]> >> wrote: >> >>> PS The “editable doc” that Rui refers to is also a good idea. I think we >>> should create it to continue discussion after the first meeting. >>> >>> Julian >>> >>>> On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]> >>> wrote: >>>> >>>> I think good next steps would be a PR and a meeting. The PR will allow >>> us to read the code, but I think we should do the first round of questions >>> at the meeting. The meeting could perhaps start with a presentation of the >>> paper (do you have some slides you are planning to present at VLDB, >>> Botong?) and then move on to questions about the concepts, which >>> alternatives were considered, and how the concepts map onto other current >>> and future concepts in calcite. >>>> >>>> I don’t think we should start “reviewing” the PR line-by-line at this >>> point. We need to understand the high-level concepts and design choices. If >>> we start reviewing the PR we will get lost in the details. >>>> >>>> I know that integrating a major change is hard; I doubt that we will be >>> able to integrate everything, but we can build understanding about where >>> calcite needs to go, and I hope integrate a good amount of code to help us >>> get there. >>>> >>>> As I said before, after the integration I would like people to be able >>> to experiment with it and use it in their production systems. That way, it >>> will not be an experiment that withers, but a feature set integrates with >>> other calcite features and gets stronger over time. >>>> >>>> Julian >>>> >>>>> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote: >>>>> >>>>> For me to participate in the discussion for the above questions, I >>> will >>>>> need to read a lot more to know relevant context and likely ask lots of >>>>> questions :-). A editable doc is probably good for questions and back >>> and >>>>> forward discussion. >>>>> >>>>> >>>>> -Rui >>>>> >>>>>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> >>> wrote: >>>>>> >>>>>> I am also happy to help push this work into Calcite (review code and >>> doc, >>>>>> etc.). >>>>>> >>>>>> While you can share your code so people can have more idea how it is >>>>>> implemented, I think it would be also nice to have a doc to discuss >>> open >>>>>> questions above. Some points that I copy those to here: >>>>>> >>>>>> 1. Can this solution be compatible with existing solutions in Calcite >>>>>> Streaming, materialized view maintenance, and multi-query optimization >>>>>> (Sigma and Delta relational operators, lattice, and Spool operator), >>>>>> 2. Did you find that you needed two separate cost models - one for >>> “view >>>>>> maintenance” and another for “user queries” - since the objectives of >>> each >>>>>> activity are so different? >>>>>> 3. whether this work will hasten the arrival of multi-objective >>> parametric >>>>>> query optimization [1] in Calcite. >>>>>> 4. probably SQL shell support. >>>>>> >>>>>> >>>>>> [1]: >>>>>> >>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >>>>>> >>>>>> >>>>>> -Rui >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote: >>>>>>> >>>>>>> it would be very nice to see a POC of your work. >>>>>>> >>>>>>> >>>>>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> >>> wrote: >>>>>>> >>>>>>>> Hi Julian, >>>>>>>> >>>>>>>> Just wondering if there are any updates? We are wondering if it >>> would >>>>>>> help >>>>>>>> to post our code for a quick preview. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Botong >>>>>>>> >>>>>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> >>> wrote: >>>>>>>> >>>>>>>>> Hi Julian, >>>>>>>>> >>>>>>>>> Thanks for your interest! Sure let's figure out a plan that best >>>>>>> benefits >>>>>>>>> the community. Here are some clarifications that hopefully answer >>> your >>>>>>>>> questions. >>>>>>>>> >>>>>>>>> In our work (Tempura), users specify the set of time points to >>>>>>> consider >>>>>>>>> running and a cost function that expresses users' preference over >>>>>>> time, >>>>>>>>> Tempura will generate the best incremental plan that minimizes the >>>>>>>> overall >>>>>>>>> cost function. >>>>>>>>> >>>>>>>>> In this incremental plan, the sub-plans at different time points >>> can >>>>>>> be >>>>>>>>> different from each other, as opposed to identical plans in all >>> delta >>>>>>>> runs >>>>>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, >>> we >>>>>>> can >>>>>>>>> mimic the current streaming implementation by specifying two >>> (logical) >>>>>>>> time >>>>>>>>> points in Tempura, representing the initial run and later delta >>> runs >>>>>>>>> respectively. In general, note that Tempura supports various form >>> of >>>>>>>>> incremental computing, not only the small-delta append-only data >>>>>>> model in >>>>>>>>> streaming systems. That's why we believe Tempura subsumes the >>> current >>>>>>>>> streaming support, as well as any IVM implementations. >>>>>>>>> >>>>>>>>> About the cost model, we did not come up with a seperate cost >>> model, >>>>>>> but >>>>>>>>> rather extended the existing one. Similar to multi-objective >>>>>>>> optimization, >>>>>>>>> costs incurred at different time points are considered different >>>>>>>>> dimensions. Tempura lets users supply a function that converts this >>>>>>> cost >>>>>>>>> vector into a final cost. So under this function, any two >>> incremental >>>>>>>> plans >>>>>>>>> are still comparable and there is an overall optimum. I guess we >>> can >>>>>>> go >>>>>>>>> down the route of multi-objective parametric query optimization >>>>>>> instead >>>>>>>> if >>>>>>>>> there is a need. >>>>>>>>> >>>>>>>>> Next on materialized views and multi-query optimization, since our >>>>>>>>> multi-time-point plan naturally involves materializing intermediate >>>>>>>> results >>>>>>>>> for later time points, we need to solve the problem of choosing >>>>>>>>> materializations and include the cost of saving and reusing the >>>>>>>>> materializations when costing and comparing plans. We borrowed the >>>>>>>>> multi-query optimization techniques to solve this problem even >>> though >>>>>>> we >>>>>>>>> are looking at a single query. As a result, we think our work is >>>>>>>> orthogonal >>>>>>>>> to Calcite's facilities around utilizing existing views, lattice >>> etc. >>>>>>> We >>>>>>>> do >>>>>>>>> feel that the multi-query optimization component can be adopted to >>>>>>> wider >>>>>>>>> use, but probably need more suggestions from the community. >>>>>>>>> >>>>>>>>> Lastly, our current implementation is set up in java code, it >>> should >>>>>>> be >>>>>>>>> straightforward to hook it up with SQL shell. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Botong >>>>>>>>> >>>>>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde < >>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Botong, >>>>>>>>>> >>>>>>>>>> This is very exciting; congratulations on this research, and thank >>>>>>> you >>>>>>>>>> for contributing it back to Calcite. >>>>>>>>>> >>>>>>>>>> The research touches several areas in Calcite: streaming, >>>>>>> materialized >>>>>>>>>> view maintenance, and multi-query optimization. As we have already >>>>>>> some >>>>>>>>>> solutions in those areas (Sigma and Delta relational operators, >>>>>>> lattice, >>>>>>>>>> and Spool operator), it will be interesting to see whether we can >>>>>>> make >>>>>>>> them >>>>>>>>>> compatible, or whether one concept can subsume others. >>>>>>>>>> >>>>>>>>>> Your work differs from streaming queries in that your relations >>> are >>>>>>> used >>>>>>>>>> by “external” user queries, whereas in pure streaming queries, the >>>>>>> only >>>>>>>>>> activity is the change propagation. Did you find that you needed >>> two >>>>>>>>>> separate cost models - one for “view maintenance” and another for >>>>>>> “user >>>>>>>>>> queries” - since the objectives of each activity are so different? >>>>>>>>>> >>>>>>>>>> I wonder whether this work will hasten the arrival of >>> multi-objective >>>>>>>>>> parametric query optimization [1] in Calcite. >>>>>>>>>> >>>>>>>>>> I will make time over the next few days to read and digest your >>>>>>> paper. >>>>>>>>>> Then I expect that we will have a back-and-forth process to create >>>>>>>>>> something that will be useful for the broader community. >>>>>>>>>> >>>>>>>>>> One thing will be particularly useful: making this functionality >>>>>>>>>> available from a SQL shell, so that people can experiment with >>> this >>>>>>>>>> functionality without writing Java code or setting up complex >>>>>>> databases >>>>>>>> and >>>>>>>>>> metadata. I have in mind something like the simple DDL operations >>>>>>> that >>>>>>>> are >>>>>>>>>> available in Calcite’s ’server’ module. I wonder whether we could >>>>>>> devise >>>>>>>>>> some kind of SQL syntax for a “multi-query”. >>>>>>>>>> >>>>>>>>>> Julian >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>> >>>>>>> >>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> >>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thanks Aron for pointing this out. To see the figure, please >>> refer >>>>>>> to >>>>>>>>>> Fig >>>>>>>>>>> 3(a) in our paper: >>>>>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Botong >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you >>>>>>> open >>>>>>>> a >>>>>>>>>> JIRA >>>>>>>>>>>> for this, people who are interested in this can subscribe to the >>>>>>>> JIRA? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Regards! >>>>>>>>>>>> >>>>>>>>>>>> Aron Tao >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> This is a proposal to extend the Calcite optimizer into a >>> general >>>>>>>>>>>>> incremental query optimizer, based on our research paper >>>>>>> published >>>>>>>> in >>>>>>>>>>>> VLDB >>>>>>>>>>>>> 2021: >>>>>>>>>>>>> Tempura: a general cost-based optimizer framework for >>> incremental >>>>>>>> data >>>>>>>>>>>>> processing >>>>>>>>>>>>> >>>>>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s >>>>>>> data >>>>>>>>>>>>> warehouse is planning to use this incremental query optimizer >>> to >>>>>>>>>>>> alleviate >>>>>>>>>>>>> cluster-wise resource skewness: >>>>>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware >>> Incremental >>>>>>>>>>>> Computing >>>>>>>>>>>>> >>>>>>>>>>>>> To our best knowledge, this is the first general cost-based >>>>>>>>>> incremental >>>>>>>>>>>>> optimizer that can find the best plan across multiple families >>> of >>>>>>>>>>>>> incremental computing methods, including IVM, Streaming, >>>>>>> DBToaster, >>>>>>>>>> etc. >>>>>>>>>>>>> Experiments (in the paper) shows that the generated best plan >>> is >>>>>>>>>>>>> consistently much better than the plans from each individual >>>>>>> method >>>>>>>>>>>> alone. >>>>>>>>>>>>> >>>>>>>>>>>>> In general, incremental query planning is central to database >>>>>>> view >>>>>>>>>>>>> maintenance and stream processing systems, and are being >>> adopted >>>>>>> in >>>>>>>>>>>> active >>>>>>>>>>>>> databases, resumable query execution, approximate query >>>>>>> processing, >>>>>>>>>> etc. >>>>>>>>>>>> We >>>>>>>>>>>>> are hoping that this feature can help widening the spectrum of >>>>>>>>>> Calcite, >>>>>>>>>>>>> solicit more use cases and adoption of Calcite. >>>>>>>>>>>>> >>>>>>>>>>>>> Below is a brief description of the technical details. Please >>>>>>> refer >>>>>>>> to >>>>>>>>>>>> the >>>>>>>>>>>>> Tempura paper for more details. We are also working on a >>> journal >>>>>>>>>> version >>>>>>>>>>>> of >>>>>>>>>>>>> the paper with more implementation details. >>>>>>>>>>>>> >>>>>>>>>>>>> Currently the query plan generated by Calcite is meant to be >>>>>>>> executed >>>>>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be >>>>>>> extended >>>>>>>>>> with >>>>>>>>>>>>> temporal information so that it is capable of generating >>>>>>> incremental >>>>>>>>>>>> plans >>>>>>>>>>>>> that include multiple sub-plans to execute at different time >>>>>>> points. >>>>>>>>>>>>> >>>>>>>>>>>>> The main idea is to view each table as one that changes over >>> time >>>>>>>>>> (Time >>>>>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced >>>>>>> TvrMetaSet >>>>>>>>>> into >>>>>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related >>>>>>> RelSets >>>>>>>>>> of a >>>>>>>>>>>>> changing table (e.g. snapshot of the table at certain time, >>>>>>> delta of >>>>>>>>>> the >>>>>>>>>>>>> table between two time points, etc.). >>>>>>>>>>>>> >>>>>>>>>>>>> [image: image.png] >>>>>>>>>>>>> >>>>>>>>>>>>> For example in the above figure, each vertical line is a >>>>>>> TvrMetaSet >>>>>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). >>> Horizontal >>>>>>>> lines >>>>>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users >>> can >>>>>>>>>> write >>>>>>>>>>>> TVR >>>>>>>>>>>>> Rewrite Rules to describe valid transformations between these >>>>>>> dots. >>>>>>>>>> For >>>>>>>>>>>>> example, the blues lines are inter-TVR rules that describe how >>> to >>>>>>>>>> compute >>>>>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red >>> lines >>>>>>>> are >>>>>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All >>>>>>> TVR >>>>>>>>>>>> rewrite >>>>>>>>>>>>> rules are logical rules. All existing Calcite rules still work >>> in >>>>>>>> the >>>>>>>>>> new >>>>>>>>>>>>> volcano system without modification. >>>>>>>>>>>>> >>>>>>>>>>>>> All changes in this feature will consist of four parts: >>>>>>>>>>>>> 1. Memo extension with TvrMetaSet >>>>>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and >>>>>>> RelNodes, >>>>>>>>>> as >>>>>>>>>>>>> well as links in between the nodes. >>>>>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule >>>>>>> engine >>>>>>>>>> API. >>>>>>>>>>>>> 4. Multi-query optimization, used to find the best incremental >>>>>>> plan >>>>>>>>>>>>> involving multiple time points. >>>>>>>>>>>>> >>>>>>>>>>>>> Note that this feature is an extension in nature and thus when >>>>>>>>>> disabled, >>>>>>>>>>>>> does not change any existing Calcite behavior. >>>>>>>>>>>>> >>>>>>>>>>>>> Other than scenarios in the paper, we also applied this >>>>>>>>>> Calcite-extended >>>>>>>>>>>>> incremental query optimizer to a type of periodic query called >>>>>>> the >>>>>>>>>>>> ‘‘range >>>>>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings >>> of >>>>>>> 80% >>>>>>>>>> on >>>>>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end >>> execution >>>>>>>>>> time. >>>>>>>>>>>>> >>>>>>>>>>>>> All comments and suggestions are welcome. Thanks and happy >>>>>>> holidays! >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Botong >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ~~~~~~~~~~~~~~~ >>>>>>> no mistakes >>>>>>> ~~~~~~~~~~~~~~~~~~ >>>>>>> >>>>>> >>> >>
