Hi all, Apology for the delay. It took us some time to clean up our code base and publicly release it (which will be out soon) for a quick peek.
We are ready to present our work. Let's schedule a time for a Zoom meeting and discuss how to integrate Tempura into Calcite. Since some of our team members are in China, we prefer the time slot of 7:00pm-11:30pm PST any day. I've added our time preference in the shared doc below. https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing We encourage everyone to add their time preferences (during 04/15-04/30) in this doc. In a week or so, we will try to settle a time that works for most. Thanks, Botong On Sat, Jan 30, 2021 at 9:19 PM Botong Huang <[email protected]> wrote: > Hi Julian and Rui, > > Sounds good to us. Please give us some time to prepare some slides for the > meeting. > > I've created a doc below for discussion. Please feel free to add more in > here: > > https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing > > Thanks, > Botong > > On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]> > wrote: > >> PS The “editable doc” that Rui refers to is also a good idea. I think we >> should create it to continue discussion after the first meeting. >> >> Julian >> >> > On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]> >> wrote: >> > >> > I think good next steps would be a PR and a meeting. The PR will allow >> us to read the code, but I think we should do the first round of questions >> at the meeting. The meeting could perhaps start with a presentation of the >> paper (do you have some slides you are planning to present at VLDB, >> Botong?) and then move on to questions about the concepts, which >> alternatives were considered, and how the concepts map onto other current >> and future concepts in calcite. >> > >> > I don’t think we should start “reviewing” the PR line-by-line at this >> point. We need to understand the high-level concepts and design choices. If >> we start reviewing the PR we will get lost in the details. >> > >> > I know that integrating a major change is hard; I doubt that we will be >> able to integrate everything, but we can build understanding about where >> calcite needs to go, and I hope integrate a good amount of code to help us >> get there. >> > >> > As I said before, after the integration I would like people to be able >> to experiment with it and use it in their production systems. That way, it >> will not be an experiment that withers, but a feature set integrates with >> other calcite features and gets stronger over time. >> > >> > Julian >> > >> >> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote: >> >> >> >> For me to participate in the discussion for the above questions, I >> will >> >> need to read a lot more to know relevant context and likely ask lots of >> >> questions :-). A editable doc is probably good for questions and back >> and >> >> forward discussion. >> >> >> >> >> >> -Rui >> >> >> >>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> >> wrote: >> >>> >> >>> I am also happy to help push this work into Calcite (review code and >> doc, >> >>> etc.). >> >>> >> >>> While you can share your code so people can have more idea how it is >> >>> implemented, I think it would be also nice to have a doc to discuss >> open >> >>> questions above. Some points that I copy those to here: >> >>> >> >>> 1. Can this solution be compatible with existing solutions in Calcite >> >>> Streaming, materialized view maintenance, and multi-query optimization >> >>> (Sigma and Delta relational operators, lattice, and Spool operator), >> >>> 2. Did you find that you needed two separate cost models - one for >> “view >> >>> maintenance” and another for “user queries” - since the objectives of >> each >> >>> activity are so different? >> >>> 3. whether this work will hasten the arrival of multi-objective >> parametric >> >>> query optimization [1] in Calcite. >> >>> 4. probably SQL shell support. >> >>> >> >>> >> >>> [1]: >> >>> >> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >> >>> >> >>> >> >>> -Rui >> >>> >> >>> >> >>> >> >>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote: >> >>>> >> >>>> it would be very nice to see a POC of your work. >> >>>> >> >>>> >> >>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> >> wrote: >> >>>> >> >>>>> Hi Julian, >> >>>>> >> >>>>> Just wondering if there are any updates? We are wondering if it >> would >> >>>> help >> >>>>> to post our code for a quick preview. >> >>>>> >> >>>>> Thanks, >> >>>>> Botong >> >>>>> >> >>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> >> wrote: >> >>>>> >> >>>>>> Hi Julian, >> >>>>>> >> >>>>>> Thanks for your interest! Sure let's figure out a plan that best >> >>>> benefits >> >>>>>> the community. Here are some clarifications that hopefully answer >> your >> >>>>>> questions. >> >>>>>> >> >>>>>> In our work (Tempura), users specify the set of time points to >> >>>> consider >> >>>>>> running and a cost function that expresses users' preference over >> >>>> time, >> >>>>>> Tempura will generate the best incremental plan that minimizes the >> >>>>> overall >> >>>>>> cost function. >> >>>>>> >> >>>>>> In this incremental plan, the sub-plans at different time points >> can >> >>>> be >> >>>>>> different from each other, as opposed to identical plans in all >> delta >> >>>>> runs >> >>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, >> we >> >>>> can >> >>>>>> mimic the current streaming implementation by specifying two >> (logical) >> >>>>> time >> >>>>>> points in Tempura, representing the initial run and later delta >> runs >> >>>>>> respectively. In general, note that Tempura supports various form >> of >> >>>>>> incremental computing, not only the small-delta append-only data >> >>>> model in >> >>>>>> streaming systems. That's why we believe Tempura subsumes the >> current >> >>>>>> streaming support, as well as any IVM implementations. >> >>>>>> >> >>>>>> About the cost model, we did not come up with a seperate cost >> model, >> >>>> but >> >>>>>> rather extended the existing one. Similar to multi-objective >> >>>>> optimization, >> >>>>>> costs incurred at different time points are considered different >> >>>>>> dimensions. Tempura lets users supply a function that converts this >> >>>> cost >> >>>>>> vector into a final cost. So under this function, any two >> incremental >> >>>>> plans >> >>>>>> are still comparable and there is an overall optimum. I guess we >> can >> >>>> go >> >>>>>> down the route of multi-objective parametric query optimization >> >>>> instead >> >>>>> if >> >>>>>> there is a need. >> >>>>>> >> >>>>>> Next on materialized views and multi-query optimization, since our >> >>>>>> multi-time-point plan naturally involves materializing intermediate >> >>>>> results >> >>>>>> for later time points, we need to solve the problem of choosing >> >>>>>> materializations and include the cost of saving and reusing the >> >>>>>> materializations when costing and comparing plans. We borrowed the >> >>>>>> multi-query optimization techniques to solve this problem even >> though >> >>>> we >> >>>>>> are looking at a single query. As a result, we think our work is >> >>>>> orthogonal >> >>>>>> to Calcite's facilities around utilizing existing views, lattice >> etc. >> >>>> We >> >>>>> do >> >>>>>> feel that the multi-query optimization component can be adopted to >> >>>> wider >> >>>>>> use, but probably need more suggestions from the community. >> >>>>>> >> >>>>>> Lastly, our current implementation is set up in java code, it >> should >> >>>> be >> >>>>>> straightforward to hook it up with SQL shell. >> >>>>>> >> >>>>>> Thanks, >> >>>>>> Botong >> >>>>>> >> >>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde < >> [email protected]> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> Botong, >> >>>>>>> >> >>>>>>> This is very exciting; congratulations on this research, and thank >> >>>> you >> >>>>>>> for contributing it back to Calcite. >> >>>>>>> >> >>>>>>> The research touches several areas in Calcite: streaming, >> >>>> materialized >> >>>>>>> view maintenance, and multi-query optimization. As we have already >> >>>> some >> >>>>>>> solutions in those areas (Sigma and Delta relational operators, >> >>>> lattice, >> >>>>>>> and Spool operator), it will be interesting to see whether we can >> >>>> make >> >>>>> them >> >>>>>>> compatible, or whether one concept can subsume others. >> >>>>>>> >> >>>>>>> Your work differs from streaming queries in that your relations >> are >> >>>> used >> >>>>>>> by “external” user queries, whereas in pure streaming queries, the >> >>>> only >> >>>>>>> activity is the change propagation. Did you find that you needed >> two >> >>>>>>> separate cost models - one for “view maintenance” and another for >> >>>> “user >> >>>>>>> queries” - since the objectives of each activity are so different? >> >>>>>>> >> >>>>>>> I wonder whether this work will hasten the arrival of >> multi-objective >> >>>>>>> parametric query optimization [1] in Calcite. >> >>>>>>> >> >>>>>>> I will make time over the next few days to read and digest your >> >>>> paper. >> >>>>>>> Then I expect that we will have a back-and-forth process to create >> >>>>>>> something that will be useful for the broader community. >> >>>>>>> >> >>>>>>> One thing will be particularly useful: making this functionality >> >>>>>>> available from a SQL shell, so that people can experiment with >> this >> >>>>>>> functionality without writing Java code or setting up complex >> >>>> databases >> >>>>> and >> >>>>>>> metadata. I have in mind something like the simple DDL operations >> >>>> that >> >>>>> are >> >>>>>>> available in Calcite’s ’server’ module. I wonder whether we could >> >>>> devise >> >>>>>>> some kind of SQL syntax for a “multi-query”. >> >>>>>>> >> >>>>>>> Julian >> >>>>>>> >> >>>>>>> [1] >> >>>>>>> >> >>>>> >> >>>> >> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> >> >>>> wrote: >> >>>>>>>> >> >>>>>>>> Thanks Aron for pointing this out. To see the figure, please >> refer >> >>>> to >> >>>>>>> Fig >> >>>>>>>> 3(a) in our paper: >> >>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf >> >>>>>>>> >> >>>>>>>> Best, >> >>>>>>>> Botong >> >>>>>>>> >> >>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> >> >>>>> wrote: >> >>>>>>>> >> >>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you >> >>>> open >> >>>>> a >> >>>>>>> JIRA >> >>>>>>>>> for this, people who are interested in this can subscribe to the >> >>>>> JIRA? >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Regards! >> >>>>>>>>> >> >>>>>>>>> Aron Tao >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: >> >>>>>>>>> >> >>>>>>>>>> Hi all, >> >>>>>>>>>> >> >>>>>>>>>> This is a proposal to extend the Calcite optimizer into a >> general >> >>>>>>>>>> incremental query optimizer, based on our research paper >> >>>> published >> >>>>> in >> >>>>>>>>> VLDB >> >>>>>>>>>> 2021: >> >>>>>>>>>> Tempura: a general cost-based optimizer framework for >> incremental >> >>>>> data >> >>>>>>>>>> processing >> >>>>>>>>>> >> >>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s >> >>>> data >> >>>>>>>>>> warehouse is planning to use this incremental query optimizer >> to >> >>>>>>>>> alleviate >> >>>>>>>>>> cluster-wise resource skewness: >> >>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware >> Incremental >> >>>>>>>>> Computing >> >>>>>>>>>> >> >>>>>>>>>> To our best knowledge, this is the first general cost-based >> >>>>>>> incremental >> >>>>>>>>>> optimizer that can find the best plan across multiple families >> of >> >>>>>>>>>> incremental computing methods, including IVM, Streaming, >> >>>> DBToaster, >> >>>>>>> etc. >> >>>>>>>>>> Experiments (in the paper) shows that the generated best plan >> is >> >>>>>>>>>> consistently much better than the plans from each individual >> >>>> method >> >>>>>>>>> alone. >> >>>>>>>>>> >> >>>>>>>>>> In general, incremental query planning is central to database >> >>>> view >> >>>>>>>>>> maintenance and stream processing systems, and are being >> adopted >> >>>> in >> >>>>>>>>> active >> >>>>>>>>>> databases, resumable query execution, approximate query >> >>>> processing, >> >>>>>>> etc. >> >>>>>>>>> We >> >>>>>>>>>> are hoping that this feature can help widening the spectrum of >> >>>>>>> Calcite, >> >>>>>>>>>> solicit more use cases and adoption of Calcite. >> >>>>>>>>>> >> >>>>>>>>>> Below is a brief description of the technical details. Please >> >>>> refer >> >>>>> to >> >>>>>>>>> the >> >>>>>>>>>> Tempura paper for more details. We are also working on a >> journal >> >>>>>>> version >> >>>>>>>>> of >> >>>>>>>>>> the paper with more implementation details. >> >>>>>>>>>> >> >>>>>>>>>> Currently the query plan generated by Calcite is meant to be >> >>>>> executed >> >>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be >> >>>> extended >> >>>>>>> with >> >>>>>>>>>> temporal information so that it is capable of generating >> >>>> incremental >> >>>>>>>>> plans >> >>>>>>>>>> that include multiple sub-plans to execute at different time >> >>>> points. >> >>>>>>>>>> >> >>>>>>>>>> The main idea is to view each table as one that changes over >> time >> >>>>>>> (Time >> >>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced >> >>>> TvrMetaSet >> >>>>>>> into >> >>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related >> >>>> RelSets >> >>>>>>> of a >> >>>>>>>>>> changing table (e.g. snapshot of the table at certain time, >> >>>> delta of >> >>>>>>> the >> >>>>>>>>>> table between two time points, etc.). >> >>>>>>>>>> >> >>>>>>>>>> [image: image.png] >> >>>>>>>>>> >> >>>>>>>>>> For example in the above figure, each vertical line is a >> >>>> TvrMetaSet >> >>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). >> Horizontal >> >>>>> lines >> >>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users >> can >> >>>>>>> write >> >>>>>>>>> TVR >> >>>>>>>>>> Rewrite Rules to describe valid transformations between these >> >>>> dots. >> >>>>>>> For >> >>>>>>>>>> example, the blues lines are inter-TVR rules that describe how >> to >> >>>>>>> compute >> >>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red >> lines >> >>>>> are >> >>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All >> >>>> TVR >> >>>>>>>>> rewrite >> >>>>>>>>>> rules are logical rules. All existing Calcite rules still work >> in >> >>>>> the >> >>>>>>> new >> >>>>>>>>>> volcano system without modification. >> >>>>>>>>>> >> >>>>>>>>>> All changes in this feature will consist of four parts: >> >>>>>>>>>> 1. Memo extension with TvrMetaSet >> >>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and >> >>>> RelNodes, >> >>>>>>> as >> >>>>>>>>>> well as links in between the nodes. >> >>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule >> >>>> engine >> >>>>>>> API. >> >>>>>>>>>> 4. Multi-query optimization, used to find the best incremental >> >>>> plan >> >>>>>>>>>> involving multiple time points. >> >>>>>>>>>> >> >>>>>>>>>> Note that this feature is an extension in nature and thus when >> >>>>>>> disabled, >> >>>>>>>>>> does not change any existing Calcite behavior. >> >>>>>>>>>> >> >>>>>>>>>> Other than scenarios in the paper, we also applied this >> >>>>>>> Calcite-extended >> >>>>>>>>>> incremental query optimizer to a type of periodic query called >> >>>> the >> >>>>>>>>> ‘‘range >> >>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings >> of >> >>>> 80% >> >>>>>>> on >> >>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end >> execution >> >>>>>>> time. >> >>>>>>>>>> >> >>>>>>>>>> All comments and suggestions are welcome. Thanks and happy >> >>>> holidays! >> >>>>>>>>>> >> >>>>>>>>>> Best, >> >>>>>>>>>> Botong >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >> >>>> >> >>>> >> >>>> -- >> >>>> ~~~~~~~~~~~~~~~ >> >>>> no mistakes >> >>>> ~~~~~~~~~~~~~~~~~~ >> >>>> >> >>> >> >
