Hi Julian and all, We've posted the Tempura code base below. Feel free to take a quick peek at the last five commits. https://github.com/alibaba/cost-based-incremental-optimizer/commits/main
I've also opened a Jira (CALCITE-4568 <https://issues.apache.org/jira/browse/CALCITE-4568>), which will serve as the umbrella Jira for the feature. In the meantime, we encourage everyone to enter the time preferences for our first meeting here: https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing Thanks, Botong On Mon, Apr 5, 2021 at 3:59 PM Julian Hyde <[email protected]> wrote: > I have added my time preferences to the doc. > > Before we meet, could you publish a PR for us to review? > > Initial discussions will need to be about architecture and high-level > design. So I would ask Calcite reviewers not to review the PR line-by-line > (or to leave comments in GitHub) but try to understand the design > holistically, and prepare questions/comments before the meeting. > > Botong, Can you please create a Calcite JIRA case for this task? JIRA how > we track long-running tasks such as this. > > Julian > > > > On Apr 3, 2021, at 5:15 PM, Botong Huang <[email protected]> wrote: > > > > Hi all, > > > > Apology for the delay. It took us some time to clean up our code base and > > publicly release it (which will be out soon) for a quick peek. > > > > We are ready to present our work. Let's schedule a time for a Zoom > > meeting and discuss how to integrate Tempura into Calcite. > > > > Since some of our team members are in China, we prefer the time slot of > > 7:00pm-11:30pm PST any day. I've added our time preference in the shared > > doc below. > > > https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing > > > > We encourage everyone to add their time preferences (during 04/15-04/30) > in > > this doc. In a week or so, we will try to settle a time that works for > > most. > > > > Thanks, > > Botong > > > > On Sat, Jan 30, 2021 at 9:19 PM Botong Huang <[email protected]> wrote: > > > >> Hi Julian and Rui, > >> > >> Sounds good to us. Please give us some time to prepare some slides for > the > >> meeting. > >> > >> I've created a doc below for discussion. Please feel free to add more in > >> here: > >> > >> > https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing > >> > >> Thanks, > >> Botong > >> > >> On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]> > >> wrote: > >> > >>> PS The “editable doc” that Rui refers to is also a good idea. I think > we > >>> should create it to continue discussion after the first meeting. > >>> > >>> Julian > >>> > >>>> On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]> > >>> wrote: > >>>> > >>>> I think good next steps would be a PR and a meeting. The PR will > allow > >>> us to read the code, but I think we should do the first round of > questions > >>> at the meeting. The meeting could perhaps start with a presentation > of the > >>> paper (do you have some slides you are planning to present at VLDB, > >>> Botong?) and then move on to questions about the concepts, which > >>> alternatives were considered, and how the concepts map onto other > current > >>> and future concepts in calcite. > >>>> > >>>> I don’t think we should start “reviewing” the PR line-by-line at this > >>> point. We need to understand the high-level concepts and design > choices. If > >>> we start reviewing the PR we will get lost in the details. > >>>> > >>>> I know that integrating a major change is hard; I doubt that we will > be > >>> able to integrate everything, but we can build understanding about > where > >>> calcite needs to go, and I hope integrate a good amount of code to > help us > >>> get there. > >>>> > >>>> As I said before, after the integration I would like people to be able > >>> to experiment with it and use it in their production systems. That > way, it > >>> will not be an experiment that withers, but a feature set integrates > with > >>> other calcite features and gets stronger over time. > >>>> > >>>> Julian > >>>> > >>>>> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote: > >>>>> > >>>>> For me to participate in the discussion for the above questions, I > >>> will > >>>>> need to read a lot more to know relevant context and likely ask lots > of > >>>>> questions :-). A editable doc is probably good for questions and > back > >>> and > >>>>> forward discussion. > >>>>> > >>>>> > >>>>> -Rui > >>>>> > >>>>>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> > >>> wrote: > >>>>>> > >>>>>> I am also happy to help push this work into Calcite (review code and > >>> doc, > >>>>>> etc.). > >>>>>> > >>>>>> While you can share your code so people can have more idea how it is > >>>>>> implemented, I think it would be also nice to have a doc to discuss > >>> open > >>>>>> questions above. Some points that I copy those to here: > >>>>>> > >>>>>> 1. Can this solution be compatible with existing solutions in > Calcite > >>>>>> Streaming, materialized view maintenance, and multi-query > optimization > >>>>>> (Sigma and Delta relational operators, lattice, and Spool operator), > >>>>>> 2. Did you find that you needed two separate cost models - one for > >>> “view > >>>>>> maintenance” and another for “user queries” - since the objectives > of > >>> each > >>>>>> activity are so different? > >>>>>> 3. whether this work will hasten the arrival of multi-objective > >>> parametric > >>>>>> query optimization [1] in Calcite. > >>>>>> 4. probably SQL shell support. > >>>>>> > >>>>>> > >>>>>> [1]: > >>>>>> > >>> > https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext > >>>>>> > >>>>>> > >>>>>> -Rui > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote: > >>>>>>> > >>>>>>> it would be very nice to see a POC of your work. > >>>>>>> > >>>>>>> > >>>>>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> > >>> wrote: > >>>>>>> > >>>>>>>> Hi Julian, > >>>>>>>> > >>>>>>>> Just wondering if there are any updates? We are wondering if it > >>> would > >>>>>>> help > >>>>>>>> to post our code for a quick preview. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Botong > >>>>>>>> > >>>>>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> > >>> wrote: > >>>>>>>> > >>>>>>>>> Hi Julian, > >>>>>>>>> > >>>>>>>>> Thanks for your interest! Sure let's figure out a plan that best > >>>>>>> benefits > >>>>>>>>> the community. Here are some clarifications that hopefully answer > >>> your > >>>>>>>>> questions. > >>>>>>>>> > >>>>>>>>> In our work (Tempura), users specify the set of time points to > >>>>>>> consider > >>>>>>>>> running and a cost function that expresses users' preference over > >>>>>>> time, > >>>>>>>>> Tempura will generate the best incremental plan that minimizes > the > >>>>>>>> overall > >>>>>>>>> cost function. > >>>>>>>>> > >>>>>>>>> In this incremental plan, the sub-plans at different time points > >>> can > >>>>>>> be > >>>>>>>>> different from each other, as opposed to identical plans in all > >>> delta > >>>>>>>> runs > >>>>>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura > paper, > >>> we > >>>>>>> can > >>>>>>>>> mimic the current streaming implementation by specifying two > >>> (logical) > >>>>>>>> time > >>>>>>>>> points in Tempura, representing the initial run and later delta > >>> runs > >>>>>>>>> respectively. In general, note that Tempura supports various form > >>> of > >>>>>>>>> incremental computing, not only the small-delta append-only data > >>>>>>> model in > >>>>>>>>> streaming systems. That's why we believe Tempura subsumes the > >>> current > >>>>>>>>> streaming support, as well as any IVM implementations. > >>>>>>>>> > >>>>>>>>> About the cost model, we did not come up with a seperate cost > >>> model, > >>>>>>> but > >>>>>>>>> rather extended the existing one. Similar to multi-objective > >>>>>>>> optimization, > >>>>>>>>> costs incurred at different time points are considered different > >>>>>>>>> dimensions. Tempura lets users supply a function that converts > this > >>>>>>> cost > >>>>>>>>> vector into a final cost. So under this function, any two > >>> incremental > >>>>>>>> plans > >>>>>>>>> are still comparable and there is an overall optimum. I guess we > >>> can > >>>>>>> go > >>>>>>>>> down the route of multi-objective parametric query optimization > >>>>>>> instead > >>>>>>>> if > >>>>>>>>> there is a need. > >>>>>>>>> > >>>>>>>>> Next on materialized views and multi-query optimization, since > our > >>>>>>>>> multi-time-point plan naturally involves materializing > intermediate > >>>>>>>> results > >>>>>>>>> for later time points, we need to solve the problem of choosing > >>>>>>>>> materializations and include the cost of saving and reusing the > >>>>>>>>> materializations when costing and comparing plans. We borrowed > the > >>>>>>>>> multi-query optimization techniques to solve this problem even > >>> though > >>>>>>> we > >>>>>>>>> are looking at a single query. As a result, we think our work is > >>>>>>>> orthogonal > >>>>>>>>> to Calcite's facilities around utilizing existing views, lattice > >>> etc. > >>>>>>> We > >>>>>>>> do > >>>>>>>>> feel that the multi-query optimization component can be adopted > to > >>>>>>> wider > >>>>>>>>> use, but probably need more suggestions from the community. > >>>>>>>>> > >>>>>>>>> Lastly, our current implementation is set up in java code, it > >>> should > >>>>>>> be > >>>>>>>>> straightforward to hook it up with SQL shell. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Botong > >>>>>>>>> > >>>>>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde < > >>> [email protected]> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Botong, > >>>>>>>>>> > >>>>>>>>>> This is very exciting; congratulations on this research, and > thank > >>>>>>> you > >>>>>>>>>> for contributing it back to Calcite. > >>>>>>>>>> > >>>>>>>>>> The research touches several areas in Calcite: streaming, > >>>>>>> materialized > >>>>>>>>>> view maintenance, and multi-query optimization. As we have > already > >>>>>>> some > >>>>>>>>>> solutions in those areas (Sigma and Delta relational operators, > >>>>>>> lattice, > >>>>>>>>>> and Spool operator), it will be interesting to see whether we > can > >>>>>>> make > >>>>>>>> them > >>>>>>>>>> compatible, or whether one concept can subsume others. > >>>>>>>>>> > >>>>>>>>>> Your work differs from streaming queries in that your relations > >>> are > >>>>>>> used > >>>>>>>>>> by “external” user queries, whereas in pure streaming queries, > the > >>>>>>> only > >>>>>>>>>> activity is the change propagation. Did you find that you needed > >>> two > >>>>>>>>>> separate cost models - one for “view maintenance” and another > for > >>>>>>> “user > >>>>>>>>>> queries” - since the objectives of each activity are so > different? > >>>>>>>>>> > >>>>>>>>>> I wonder whether this work will hasten the arrival of > >>> multi-objective > >>>>>>>>>> parametric query optimization [1] in Calcite. > >>>>>>>>>> > >>>>>>>>>> I will make time over the next few days to read and digest your > >>>>>>> paper. > >>>>>>>>>> Then I expect that we will have a back-and-forth process to > create > >>>>>>>>>> something that will be useful for the broader community. > >>>>>>>>>> > >>>>>>>>>> One thing will be particularly useful: making this functionality > >>>>>>>>>> available from a SQL shell, so that people can experiment with > >>> this > >>>>>>>>>> functionality without writing Java code or setting up complex > >>>>>>> databases > >>>>>>>> and > >>>>>>>>>> metadata. I have in mind something like the simple DDL > operations > >>>>>>> that > >>>>>>>> are > >>>>>>>>>> available in Calcite’s ’server’ module. I wonder whether we > could > >>>>>>> devise > >>>>>>>>>> some kind of SQL syntax for a “multi-query”. > >>>>>>>>>> > >>>>>>>>>> Julian > >>>>>>>>>> > >>>>>>>>>> [1] > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>> > https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> > >>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Thanks Aron for pointing this out. To see the figure, please > >>> refer > >>>>>>> to > >>>>>>>>>> Fig > >>>>>>>>>>> 3(a) in our paper: > >>>>>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Botong > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao < > [email protected]> > >>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Seems interesting, the pic can not be seen in the mail, may > you > >>>>>>> open > >>>>>>>> a > >>>>>>>>>> JIRA > >>>>>>>>>>>> for this, people who are interested in this can subscribe to > the > >>>>>>>> JIRA? > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Regards! > >>>>>>>>>>>> > >>>>>>>>>>>> Aron Tao > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>> > >>>>>>>>>>>>> This is a proposal to extend the Calcite optimizer into a > >>> general > >>>>>>>>>>>>> incremental query optimizer, based on our research paper > >>>>>>> published > >>>>>>>> in > >>>>>>>>>>>> VLDB > >>>>>>>>>>>>> 2021: > >>>>>>>>>>>>> Tempura: a general cost-based optimizer framework for > >>> incremental > >>>>>>>> data > >>>>>>>>>>>>> processing > >>>>>>>>>>>>> > >>>>>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s > >>>>>>> data > >>>>>>>>>>>>> warehouse is planning to use this incremental query optimizer > >>> to > >>>>>>>>>>>> alleviate > >>>>>>>>>>>>> cluster-wise resource skewness: > >>>>>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware > >>> Incremental > >>>>>>>>>>>> Computing > >>>>>>>>>>>>> > >>>>>>>>>>>>> To our best knowledge, this is the first general cost-based > >>>>>>>>>> incremental > >>>>>>>>>>>>> optimizer that can find the best plan across multiple > families > >>> of > >>>>>>>>>>>>> incremental computing methods, including IVM, Streaming, > >>>>>>> DBToaster, > >>>>>>>>>> etc. > >>>>>>>>>>>>> Experiments (in the paper) shows that the generated best plan > >>> is > >>>>>>>>>>>>> consistently much better than the plans from each individual > >>>>>>> method > >>>>>>>>>>>> alone. > >>>>>>>>>>>>> > >>>>>>>>>>>>> In general, incremental query planning is central to database > >>>>>>> view > >>>>>>>>>>>>> maintenance and stream processing systems, and are being > >>> adopted > >>>>>>> in > >>>>>>>>>>>> active > >>>>>>>>>>>>> databases, resumable query execution, approximate query > >>>>>>> processing, > >>>>>>>>>> etc. > >>>>>>>>>>>> We > >>>>>>>>>>>>> are hoping that this feature can help widening the spectrum > of > >>>>>>>>>> Calcite, > >>>>>>>>>>>>> solicit more use cases and adoption of Calcite. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Below is a brief description of the technical details. Please > >>>>>>> refer > >>>>>>>> to > >>>>>>>>>>>> the > >>>>>>>>>>>>> Tempura paper for more details. We are also working on a > >>> journal > >>>>>>>>>> version > >>>>>>>>>>>> of > >>>>>>>>>>>>> the paper with more implementation details. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Currently the query plan generated by Calcite is meant to be > >>>>>>>> executed > >>>>>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be > >>>>>>> extended > >>>>>>>>>> with > >>>>>>>>>>>>> temporal information so that it is capable of generating > >>>>>>> incremental > >>>>>>>>>>>> plans > >>>>>>>>>>>>> that include multiple sub-plans to execute at different time > >>>>>>> points. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The main idea is to view each table as one that changes over > >>> time > >>>>>>>>>> (Time > >>>>>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced > >>>>>>> TvrMetaSet > >>>>>>>>>> into > >>>>>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related > >>>>>>> RelSets > >>>>>>>>>> of a > >>>>>>>>>>>>> changing table (e.g. snapshot of the table at certain time, > >>>>>>> delta of > >>>>>>>>>> the > >>>>>>>>>>>>> table between two time points, etc.). > >>>>>>>>>>>>> > >>>>>>>>>>>>> [image: image.png] > >>>>>>>>>>>>> > >>>>>>>>>>>>> For example in the above figure, each vertical line is a > >>>>>>> TvrMetaSet > >>>>>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). > >>> Horizontal > >>>>>>>> lines > >>>>>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users > >>> can > >>>>>>>>>> write > >>>>>>>>>>>> TVR > >>>>>>>>>>>>> Rewrite Rules to describe valid transformations between these > >>>>>>> dots. > >>>>>>>>>> For > >>>>>>>>>>>>> example, the blues lines are inter-TVR rules that describe > how > >>> to > >>>>>>>>>> compute > >>>>>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red > >>> lines > >>>>>>>> are > >>>>>>>>>>>>> intra-TVR rules that describe transformations within a TVR. > All > >>>>>>> TVR > >>>>>>>>>>>> rewrite > >>>>>>>>>>>>> rules are logical rules. All existing Calcite rules still > work > >>> in > >>>>>>>> the > >>>>>>>>>> new > >>>>>>>>>>>>> volcano system without modification. > >>>>>>>>>>>>> > >>>>>>>>>>>>> All changes in this feature will consist of four parts: > >>>>>>>>>>>>> 1. Memo extension with TvrMetaSet > >>>>>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and > >>>>>>> RelNodes, > >>>>>>>>>> as > >>>>>>>>>>>>> well as links in between the nodes. > >>>>>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule > >>>>>>> engine > >>>>>>>>>> API. > >>>>>>>>>>>>> 4. Multi-query optimization, used to find the best > incremental > >>>>>>> plan > >>>>>>>>>>>>> involving multiple time points. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Note that this feature is an extension in nature and thus > when > >>>>>>>>>> disabled, > >>>>>>>>>>>>> does not change any existing Calcite behavior. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Other than scenarios in the paper, we also applied this > >>>>>>>>>> Calcite-extended > >>>>>>>>>>>>> incremental query optimizer to a type of periodic query > called > >>>>>>> the > >>>>>>>>>>>> ‘‘range > >>>>>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings > >>> of > >>>>>>> 80% > >>>>>>>>>> on > >>>>>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end > >>> execution > >>>>>>>>>> time. > >>>>>>>>>>>>> > >>>>>>>>>>>>> All comments and suggestions are welcome. Thanks and happy > >>>>>>> holidays! > >>>>>>>>>>>>> > >>>>>>>>>>>>> Best, > >>>>>>>>>>>>> Botong > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> ~~~~~~~~~~~~~~~ > >>>>>>> no mistakes > >>>>>>> ~~~~~~~~~~~~~~~~~~ > >>>>>>> > >>>>>> > >>> > >> > >
