Hi all, According to the preferences collected, we are tentatively scheduling our meeting at 9pm-10:30pm PST on 04/26 Monday.
We will give a presentation about Tempura, followed by a free discussion. Please let us know if there are new other requests. Few days before the meeting, I will send out a zoom meeting link. Thanks, Botong On Wed, Apr 7, 2021 at 2:46 PM Botong Huang <[email protected]> wrote: > Hi Julian and all, > > We've posted the Tempura code base below. Feel free to take a quick peek > at the last five commits. > https://github.com/alibaba/cost-based-incremental-optimizer/commits/main > > I've also opened a Jira (CALCITE-4568 > <https://issues.apache.org/jira/browse/CALCITE-4568>), which will serve > as the umbrella Jira for the feature. > > In the meantime, we encourage everyone to enter the time preferences for > our first meeting here: > > https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing > > Thanks, > Botong > > On Mon, Apr 5, 2021 at 3:59 PM Julian Hyde <[email protected]> wrote: > >> I have added my time preferences to the doc. >> >> Before we meet, could you publish a PR for us to review? >> >> Initial discussions will need to be about architecture and high-level >> design. So I would ask Calcite reviewers not to review the PR line-by-line >> (or to leave comments in GitHub) but try to understand the design >> holistically, and prepare questions/comments before the meeting. >> >> Botong, Can you please create a Calcite JIRA case for this task? JIRA how >> we track long-running tasks such as this. >> >> Julian >> >> >> > On Apr 3, 2021, at 5:15 PM, Botong Huang <[email protected]> wrote: >> > >> > Hi all, >> > >> > Apology for the delay. It took us some time to clean up our code base >> and >> > publicly release it (which will be out soon) for a quick peek. >> > >> > We are ready to present our work. Let's schedule a time for a Zoom >> > meeting and discuss how to integrate Tempura into Calcite. >> > >> > Since some of our team members are in China, we prefer the time slot of >> > 7:00pm-11:30pm PST any day. I've added our time preference in the shared >> > doc below. >> > >> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing >> > >> > We encourage everyone to add their time preferences (during >> 04/15-04/30) in >> > this doc. In a week or so, we will try to settle a time that works for >> > most. >> > >> > Thanks, >> > Botong >> > >> > On Sat, Jan 30, 2021 at 9:19 PM Botong Huang <[email protected]> wrote: >> > >> >> Hi Julian and Rui, >> >> >> >> Sounds good to us. Please give us some time to prepare some slides for >> the >> >> meeting. >> >> >> >> I've created a doc below for discussion. Please feel free to add more >> in >> >> here: >> >> >> >> >> https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing >> >> >> >> Thanks, >> >> Botong >> >> >> >> On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]> >> >> wrote: >> >> >> >>> PS The “editable doc” that Rui refers to is also a good idea. I think >> we >> >>> should create it to continue discussion after the first meeting. >> >>> >> >>> Julian >> >>> >> >>>> On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]> >> >>> wrote: >> >>>> >> >>>> I think good next steps would be a PR and a meeting. The PR will >> allow >> >>> us to read the code, but I think we should do the first round of >> questions >> >>> at the meeting. The meeting could perhaps start with a presentation >> of the >> >>> paper (do you have some slides you are planning to present at VLDB, >> >>> Botong?) and then move on to questions about the concepts, which >> >>> alternatives were considered, and how the concepts map onto other >> current >> >>> and future concepts in calcite. >> >>>> >> >>>> I don’t think we should start “reviewing” the PR line-by-line at this >> >>> point. We need to understand the high-level concepts and design >> choices. If >> >>> we start reviewing the PR we will get lost in the details. >> >>>> >> >>>> I know that integrating a major change is hard; I doubt that we will >> be >> >>> able to integrate everything, but we can build understanding about >> where >> >>> calcite needs to go, and I hope integrate a good amount of code to >> help us >> >>> get there. >> >>>> >> >>>> As I said before, after the integration I would like people to be >> able >> >>> to experiment with it and use it in their production systems. That >> way, it >> >>> will not be an experiment that withers, but a feature set integrates >> with >> >>> other calcite features and gets stronger over time. >> >>>> >> >>>> Julian >> >>>> >> >>>>> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> >> wrote: >> >>>>> >> >>>>> For me to participate in the discussion for the above questions, I >> >>> will >> >>>>> need to read a lot more to know relevant context and likely ask >> lots of >> >>>>> questions :-). A editable doc is probably good for questions and >> back >> >>> and >> >>>>> forward discussion. >> >>>>> >> >>>>> >> >>>>> -Rui >> >>>>> >> >>>>>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> >> >>> wrote: >> >>>>>> >> >>>>>> I am also happy to help push this work into Calcite (review code >> and >> >>> doc, >> >>>>>> etc.). >> >>>>>> >> >>>>>> While you can share your code so people can have more idea how it >> is >> >>>>>> implemented, I think it would be also nice to have a doc to discuss >> >>> open >> >>>>>> questions above. Some points that I copy those to here: >> >>>>>> >> >>>>>> 1. Can this solution be compatible with existing solutions in >> Calcite >> >>>>>> Streaming, materialized view maintenance, and multi-query >> optimization >> >>>>>> (Sigma and Delta relational operators, lattice, and Spool >> operator), >> >>>>>> 2. Did you find that you needed two separate cost models - one for >> >>> “view >> >>>>>> maintenance” and another for “user queries” - since the objectives >> of >> >>> each >> >>>>>> activity are so different? >> >>>>>> 3. whether this work will hasten the arrival of multi-objective >> >>> parametric >> >>>>>> query optimization [1] in Calcite. >> >>>>>> 4. probably SQL shell support. >> >>>>>> >> >>>>>> >> >>>>>> [1]: >> >>>>>> >> >>> >> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >> >>>>>> >> >>>>>> >> >>>>>> -Rui >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> >> wrote: >> >>>>>>> >> >>>>>>> it would be very nice to see a POC of your work. >> >>>>>>> >> >>>>>>> >> >>>>>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> >> >>> wrote: >> >>>>>>> >> >>>>>>>> Hi Julian, >> >>>>>>>> >> >>>>>>>> Just wondering if there are any updates? We are wondering if it >> >>> would >> >>>>>>> help >> >>>>>>>> to post our code for a quick preview. >> >>>>>>>> >> >>>>>>>> Thanks, >> >>>>>>>> Botong >> >>>>>>>> >> >>>>>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> >> >>> wrote: >> >>>>>>>> >> >>>>>>>>> Hi Julian, >> >>>>>>>>> >> >>>>>>>>> Thanks for your interest! Sure let's figure out a plan that best >> >>>>>>> benefits >> >>>>>>>>> the community. Here are some clarifications that hopefully >> answer >> >>> your >> >>>>>>>>> questions. >> >>>>>>>>> >> >>>>>>>>> In our work (Tempura), users specify the set of time points to >> >>>>>>> consider >> >>>>>>>>> running and a cost function that expresses users' preference >> over >> >>>>>>> time, >> >>>>>>>>> Tempura will generate the best incremental plan that minimizes >> the >> >>>>>>>> overall >> >>>>>>>>> cost function. >> >>>>>>>>> >> >>>>>>>>> In this incremental plan, the sub-plans at different time points >> >>> can >> >>>>>>> be >> >>>>>>>>> different from each other, as opposed to identical plans in all >> >>> delta >> >>>>>>>> runs >> >>>>>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura >> paper, >> >>> we >> >>>>>>> can >> >>>>>>>>> mimic the current streaming implementation by specifying two >> >>> (logical) >> >>>>>>>> time >> >>>>>>>>> points in Tempura, representing the initial run and later delta >> >>> runs >> >>>>>>>>> respectively. In general, note that Tempura supports various >> form >> >>> of >> >>>>>>>>> incremental computing, not only the small-delta append-only data >> >>>>>>> model in >> >>>>>>>>> streaming systems. That's why we believe Tempura subsumes the >> >>> current >> >>>>>>>>> streaming support, as well as any IVM implementations. >> >>>>>>>>> >> >>>>>>>>> About the cost model, we did not come up with a seperate cost >> >>> model, >> >>>>>>> but >> >>>>>>>>> rather extended the existing one. Similar to multi-objective >> >>>>>>>> optimization, >> >>>>>>>>> costs incurred at different time points are considered different >> >>>>>>>>> dimensions. Tempura lets users supply a function that converts >> this >> >>>>>>> cost >> >>>>>>>>> vector into a final cost. So under this function, any two >> >>> incremental >> >>>>>>>> plans >> >>>>>>>>> are still comparable and there is an overall optimum. I guess we >> >>> can >> >>>>>>> go >> >>>>>>>>> down the route of multi-objective parametric query optimization >> >>>>>>> instead >> >>>>>>>> if >> >>>>>>>>> there is a need. >> >>>>>>>>> >> >>>>>>>>> Next on materialized views and multi-query optimization, since >> our >> >>>>>>>>> multi-time-point plan naturally involves materializing >> intermediate >> >>>>>>>> results >> >>>>>>>>> for later time points, we need to solve the problem of choosing >> >>>>>>>>> materializations and include the cost of saving and reusing the >> >>>>>>>>> materializations when costing and comparing plans. We borrowed >> the >> >>>>>>>>> multi-query optimization techniques to solve this problem even >> >>> though >> >>>>>>> we >> >>>>>>>>> are looking at a single query. As a result, we think our work is >> >>>>>>>> orthogonal >> >>>>>>>>> to Calcite's facilities around utilizing existing views, lattice >> >>> etc. >> >>>>>>> We >> >>>>>>>> do >> >>>>>>>>> feel that the multi-query optimization component can be adopted >> to >> >>>>>>> wider >> >>>>>>>>> use, but probably need more suggestions from the community. >> >>>>>>>>> >> >>>>>>>>> Lastly, our current implementation is set up in java code, it >> >>> should >> >>>>>>> be >> >>>>>>>>> straightforward to hook it up with SQL shell. >> >>>>>>>>> >> >>>>>>>>> Thanks, >> >>>>>>>>> Botong >> >>>>>>>>> >> >>>>>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde < >> >>> [email protected]> >> >>>>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>>> Botong, >> >>>>>>>>>> >> >>>>>>>>>> This is very exciting; congratulations on this research, and >> thank >> >>>>>>> you >> >>>>>>>>>> for contributing it back to Calcite. >> >>>>>>>>>> >> >>>>>>>>>> The research touches several areas in Calcite: streaming, >> >>>>>>> materialized >> >>>>>>>>>> view maintenance, and multi-query optimization. As we have >> already >> >>>>>>> some >> >>>>>>>>>> solutions in those areas (Sigma and Delta relational operators, >> >>>>>>> lattice, >> >>>>>>>>>> and Spool operator), it will be interesting to see whether we >> can >> >>>>>>> make >> >>>>>>>> them >> >>>>>>>>>> compatible, or whether one concept can subsume others. >> >>>>>>>>>> >> >>>>>>>>>> Your work differs from streaming queries in that your relations >> >>> are >> >>>>>>> used >> >>>>>>>>>> by “external” user queries, whereas in pure streaming queries, >> the >> >>>>>>> only >> >>>>>>>>>> activity is the change propagation. Did you find that you >> needed >> >>> two >> >>>>>>>>>> separate cost models - one for “view maintenance” and another >> for >> >>>>>>> “user >> >>>>>>>>>> queries” - since the objectives of each activity are so >> different? >> >>>>>>>>>> >> >>>>>>>>>> I wonder whether this work will hasten the arrival of >> >>> multi-objective >> >>>>>>>>>> parametric query optimization [1] in Calcite. >> >>>>>>>>>> >> >>>>>>>>>> I will make time over the next few days to read and digest your >> >>>>>>> paper. >> >>>>>>>>>> Then I expect that we will have a back-and-forth process to >> create >> >>>>>>>>>> something that will be useful for the broader community. >> >>>>>>>>>> >> >>>>>>>>>> One thing will be particularly useful: making this >> functionality >> >>>>>>>>>> available from a SQL shell, so that people can experiment with >> >>> this >> >>>>>>>>>> functionality without writing Java code or setting up complex >> >>>>>>> databases >> >>>>>>>> and >> >>>>>>>>>> metadata. I have in mind something like the simple DDL >> operations >> >>>>>>> that >> >>>>>>>> are >> >>>>>>>>>> available in Calcite’s ’server’ module. I wonder whether we >> could >> >>>>>>> devise >> >>>>>>>>>> some kind of SQL syntax for a “multi-query”. >> >>>>>>>>>> >> >>>>>>>>>> Julian >> >>>>>>>>>> >> >>>>>>>>>> [1] >> >>>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>> >> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> >> >>>>>>> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks Aron for pointing this out. To see the figure, please >> >>> refer >> >>>>>>> to >> >>>>>>>>>> Fig >> >>>>>>>>>>> 3(a) in our paper: >> >>>>>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf >> >>>>>>>>>>> >> >>>>>>>>>>> Best, >> >>>>>>>>>>> Botong >> >>>>>>>>>>> >> >>>>>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao < >> [email protected]> >> >>>>>>>> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>>> Seems interesting, the pic can not be seen in the mail, may >> you >> >>>>>>> open >> >>>>>>>> a >> >>>>>>>>>> JIRA >> >>>>>>>>>>>> for this, people who are interested in this can subscribe to >> the >> >>>>>>>> JIRA? >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> Regards! >> >>>>>>>>>>>> >> >>>>>>>>>>>> Aron Tao >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: >> >>>>>>>>>>>> >> >>>>>>>>>>>>> Hi all, >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> This is a proposal to extend the Calcite optimizer into a >> >>> general >> >>>>>>>>>>>>> incremental query optimizer, based on our research paper >> >>>>>>> published >> >>>>>>>> in >> >>>>>>>>>>>> VLDB >> >>>>>>>>>>>>> 2021: >> >>>>>>>>>>>>> Tempura: a general cost-based optimizer framework for >> >>> incremental >> >>>>>>>> data >> >>>>>>>>>>>>> processing >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how >> Alibaba’s >> >>>>>>> data >> >>>>>>>>>>>>> warehouse is planning to use this incremental query >> optimizer >> >>> to >> >>>>>>>>>>>> alleviate >> >>>>>>>>>>>>> cluster-wise resource skewness: >> >>>>>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware >> >>> Incremental >> >>>>>>>>>>>> Computing >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> To our best knowledge, this is the first general cost-based >> >>>>>>>>>> incremental >> >>>>>>>>>>>>> optimizer that can find the best plan across multiple >> families >> >>> of >> >>>>>>>>>>>>> incremental computing methods, including IVM, Streaming, >> >>>>>>> DBToaster, >> >>>>>>>>>> etc. >> >>>>>>>>>>>>> Experiments (in the paper) shows that the generated best >> plan >> >>> is >> >>>>>>>>>>>>> consistently much better than the plans from each individual >> >>>>>>> method >> >>>>>>>>>>>> alone. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> In general, incremental query planning is central to >> database >> >>>>>>> view >> >>>>>>>>>>>>> maintenance and stream processing systems, and are being >> >>> adopted >> >>>>>>> in >> >>>>>>>>>>>> active >> >>>>>>>>>>>>> databases, resumable query execution, approximate query >> >>>>>>> processing, >> >>>>>>>>>> etc. >> >>>>>>>>>>>> We >> >>>>>>>>>>>>> are hoping that this feature can help widening the spectrum >> of >> >>>>>>>>>> Calcite, >> >>>>>>>>>>>>> solicit more use cases and adoption of Calcite. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Below is a brief description of the technical details. >> Please >> >>>>>>> refer >> >>>>>>>> to >> >>>>>>>>>>>> the >> >>>>>>>>>>>>> Tempura paper for more details. We are also working on a >> >>> journal >> >>>>>>>>>> version >> >>>>>>>>>>>> of >> >>>>>>>>>>>>> the paper with more implementation details. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Currently the query plan generated by Calcite is meant to be >> >>>>>>>> executed >> >>>>>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be >> >>>>>>> extended >> >>>>>>>>>> with >> >>>>>>>>>>>>> temporal information so that it is capable of generating >> >>>>>>> incremental >> >>>>>>>>>>>> plans >> >>>>>>>>>>>>> that include multiple sub-plans to execute at different time >> >>>>>>> points. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> The main idea is to view each table as one that changes over >> >>> time >> >>>>>>>>>> (Time >> >>>>>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced >> >>>>>>> TvrMetaSet >> >>>>>>>>>> into >> >>>>>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related >> >>>>>>> RelSets >> >>>>>>>>>> of a >> >>>>>>>>>>>>> changing table (e.g. snapshot of the table at certain time, >> >>>>>>> delta of >> >>>>>>>>>> the >> >>>>>>>>>>>>> table between two time points, etc.). >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> [image: image.png] >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> For example in the above figure, each vertical line is a >> >>>>>>> TvrMetaSet >> >>>>>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). >> >>> Horizontal >> >>>>>>>> lines >> >>>>>>>>>>>>> represent time. Each black dot in the grid is a RelSet. >> Users >> >>> can >> >>>>>>>>>> write >> >>>>>>>>>>>> TVR >> >>>>>>>>>>>>> Rewrite Rules to describe valid transformations between >> these >> >>>>>>> dots. >> >>>>>>>>>> For >> >>>>>>>>>>>>> example, the blues lines are inter-TVR rules that describe >> how >> >>> to >> >>>>>>>>>> compute >> >>>>>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red >> >>> lines >> >>>>>>>> are >> >>>>>>>>>>>>> intra-TVR rules that describe transformations within a TVR. >> All >> >>>>>>> TVR >> >>>>>>>>>>>> rewrite >> >>>>>>>>>>>>> rules are logical rules. All existing Calcite rules still >> work >> >>> in >> >>>>>>>> the >> >>>>>>>>>> new >> >>>>>>>>>>>>> volcano system without modification. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> All changes in this feature will consist of four parts: >> >>>>>>>>>>>>> 1. Memo extension with TvrMetaSet >> >>>>>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and >> >>>>>>> RelNodes, >> >>>>>>>>>> as >> >>>>>>>>>>>>> well as links in between the nodes. >> >>>>>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule >> >>>>>>> engine >> >>>>>>>>>> API. >> >>>>>>>>>>>>> 4. Multi-query optimization, used to find the best >> incremental >> >>>>>>> plan >> >>>>>>>>>>>>> involving multiple time points. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Note that this feature is an extension in nature and thus >> when >> >>>>>>>>>> disabled, >> >>>>>>>>>>>>> does not change any existing Calcite behavior. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Other than scenarios in the paper, we also applied this >> >>>>>>>>>> Calcite-extended >> >>>>>>>>>>>>> incremental query optimizer to a type of periodic query >> called >> >>>>>>> the >> >>>>>>>>>>>> ‘‘range >> >>>>>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost >> savings >> >>> of >> >>>>>>> 80% >> >>>>>>>>>> on >> >>>>>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end >> >>> execution >> >>>>>>>>>> time. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> All comments and suggestions are welcome. Thanks and happy >> >>>>>>> holidays! >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Best, >> >>>>>>>>>>>>> Botong >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> ~~~~~~~~~~~~~~~ >> >>>>>>> no mistakes >> >>>>>>> ~~~~~~~~~~~~~~~~~~ >> >>>>>>> >> >>>>>> >> >>> >> >> >> >>
