Hi Julian and Rui, Sounds good to us. Please give us some time to prepare some slides for the meeting.
I've created a doc below for discussion. Please feel free to add more in here: https://docs.google.com/document/d/1wyNjB94uSGwHtVvGYDwaLlCghUJE-7aDLnCdKKXJN1o/edit?usp=sharing Thanks, Botong On Thu, Jan 28, 2021 at 11:18 AM Julian Hyde <[email protected]> wrote: > PS The “editable doc” that Rui refers to is also a good idea. I think we > should create it to continue discussion after the first meeting. > > Julian > > > On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]> > wrote: > > > > I think good next steps would be a PR and a meeting. The PR will allow > us to read the code, but I think we should do the first round of questions > at the meeting. The meeting could perhaps start with a presentation of the > paper (do you have some slides you are planning to present at VLDB, > Botong?) and then move on to questions about the concepts, which > alternatives were considered, and how the concepts map onto other current > and future concepts in calcite. > > > > I don’t think we should start “reviewing” the PR line-by-line at this > point. We need to understand the high-level concepts and design choices. If > we start reviewing the PR we will get lost in the details. > > > > I know that integrating a major change is hard; I doubt that we will be > able to integrate everything, but we can build understanding about where > calcite needs to go, and I hope integrate a good amount of code to help us > get there. > > > > As I said before, after the integration I would like people to be able > to experiment with it and use it in their production systems. That way, it > will not be an experiment that withers, but a feature set integrates with > other calcite features and gets stronger over time. > > > > Julian > > > >> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote: > >> > >> For me to participate in the discussion for the above questions, I will > >> need to read a lot more to know relevant context and likely ask lots of > >> questions :-). A editable doc is probably good for questions and back > and > >> forward discussion. > >> > >> > >> -Rui > >> > >>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> > wrote: > >>> > >>> I am also happy to help push this work into Calcite (review code and > doc, > >>> etc.). > >>> > >>> While you can share your code so people can have more idea how it is > >>> implemented, I think it would be also nice to have a doc to discuss > open > >>> questions above. Some points that I copy those to here: > >>> > >>> 1. Can this solution be compatible with existing solutions in Calcite > >>> Streaming, materialized view maintenance, and multi-query optimization > >>> (Sigma and Delta relational operators, lattice, and Spool operator), > >>> 2. Did you find that you needed two separate cost models - one for > “view > >>> maintenance” and another for “user queries” - since the objectives of > each > >>> activity are so different? > >>> 3. whether this work will hasten the arrival of multi-objective > parametric > >>> query optimization [1] in Calcite. > >>> 4. probably SQL shell support. > >>> > >>> > >>> [1]: > >>> > https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext > >>> > >>> > >>> -Rui > >>> > >>> > >>> > >>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote: > >>>> > >>>> it would be very nice to see a POC of your work. > >>>> > >>>> > >>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> > wrote: > >>>> > >>>>> Hi Julian, > >>>>> > >>>>> Just wondering if there are any updates? We are wondering if it would > >>>> help > >>>>> to post our code for a quick preview. > >>>>> > >>>>> Thanks, > >>>>> Botong > >>>>> > >>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> > wrote: > >>>>> > >>>>>> Hi Julian, > >>>>>> > >>>>>> Thanks for your interest! Sure let's figure out a plan that best > >>>> benefits > >>>>>> the community. Here are some clarifications that hopefully answer > your > >>>>>> questions. > >>>>>> > >>>>>> In our work (Tempura), users specify the set of time points to > >>>> consider > >>>>>> running and a cost function that expresses users' preference over > >>>> time, > >>>>>> Tempura will generate the best incremental plan that minimizes the > >>>>> overall > >>>>>> cost function. > >>>>>> > >>>>>> In this incremental plan, the sub-plans at different time points can > >>>> be > >>>>>> different from each other, as opposed to identical plans in all > delta > >>>>> runs > >>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, > we > >>>> can > >>>>>> mimic the current streaming implementation by specifying two > (logical) > >>>>> time > >>>>>> points in Tempura, representing the initial run and later delta runs > >>>>>> respectively. In general, note that Tempura supports various form of > >>>>>> incremental computing, not only the small-delta append-only data > >>>> model in > >>>>>> streaming systems. That's why we believe Tempura subsumes the > current > >>>>>> streaming support, as well as any IVM implementations. > >>>>>> > >>>>>> About the cost model, we did not come up with a seperate cost model, > >>>> but > >>>>>> rather extended the existing one. Similar to multi-objective > >>>>> optimization, > >>>>>> costs incurred at different time points are considered different > >>>>>> dimensions. Tempura lets users supply a function that converts this > >>>> cost > >>>>>> vector into a final cost. So under this function, any two > incremental > >>>>> plans > >>>>>> are still comparable and there is an overall optimum. I guess we can > >>>> go > >>>>>> down the route of multi-objective parametric query optimization > >>>> instead > >>>>> if > >>>>>> there is a need. > >>>>>> > >>>>>> Next on materialized views and multi-query optimization, since our > >>>>>> multi-time-point plan naturally involves materializing intermediate > >>>>> results > >>>>>> for later time points, we need to solve the problem of choosing > >>>>>> materializations and include the cost of saving and reusing the > >>>>>> materializations when costing and comparing plans. We borrowed the > >>>>>> multi-query optimization techniques to solve this problem even > though > >>>> we > >>>>>> are looking at a single query. As a result, we think our work is > >>>>> orthogonal > >>>>>> to Calcite's facilities around utilizing existing views, lattice > etc. > >>>> We > >>>>> do > >>>>>> feel that the multi-query optimization component can be adopted to > >>>> wider > >>>>>> use, but probably need more suggestions from the community. > >>>>>> > >>>>>> Lastly, our current implementation is set up in java code, it should > >>>> be > >>>>>> straightforward to hook it up with SQL shell. > >>>>>> > >>>>>> Thanks, > >>>>>> Botong > >>>>>> > >>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected] > > > >>>>>> wrote: > >>>>>> > >>>>>>> Botong, > >>>>>>> > >>>>>>> This is very exciting; congratulations on this research, and thank > >>>> you > >>>>>>> for contributing it back to Calcite. > >>>>>>> > >>>>>>> The research touches several areas in Calcite: streaming, > >>>> materialized > >>>>>>> view maintenance, and multi-query optimization. As we have already > >>>> some > >>>>>>> solutions in those areas (Sigma and Delta relational operators, > >>>> lattice, > >>>>>>> and Spool operator), it will be interesting to see whether we can > >>>> make > >>>>> them > >>>>>>> compatible, or whether one concept can subsume others. > >>>>>>> > >>>>>>> Your work differs from streaming queries in that your relations are > >>>> used > >>>>>>> by “external” user queries, whereas in pure streaming queries, the > >>>> only > >>>>>>> activity is the change propagation. Did you find that you needed > two > >>>>>>> separate cost models - one for “view maintenance” and another for > >>>> “user > >>>>>>> queries” - since the objectives of each activity are so different? > >>>>>>> > >>>>>>> I wonder whether this work will hasten the arrival of > multi-objective > >>>>>>> parametric query optimization [1] in Calcite. > >>>>>>> > >>>>>>> I will make time over the next few days to read and digest your > >>>> paper. > >>>>>>> Then I expect that we will have a back-and-forth process to create > >>>>>>> something that will be useful for the broader community. > >>>>>>> > >>>>>>> One thing will be particularly useful: making this functionality > >>>>>>> available from a SQL shell, so that people can experiment with this > >>>>>>> functionality without writing Java code or setting up complex > >>>> databases > >>>>> and > >>>>>>> metadata. I have in mind something like the simple DDL operations > >>>> that > >>>>> are > >>>>>>> available in Calcite’s ’server’ module. I wonder whether we could > >>>> devise > >>>>>>> some kind of SQL syntax for a “multi-query”. > >>>>>>> > >>>>>>> Julian > >>>>>>> > >>>>>>> [1] > >>>>>>> > >>>>> > >>>> > https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> > >>>> wrote: > >>>>>>>> > >>>>>>>> Thanks Aron for pointing this out. To see the figure, please refer > >>>> to > >>>>>>> Fig > >>>>>>>> 3(a) in our paper: > >>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Botong > >>>>>>>> > >>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you > >>>> open > >>>>> a > >>>>>>> JIRA > >>>>>>>>> for this, people who are interested in this can subscribe to the > >>>>> JIRA? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Regards! > >>>>>>>>> > >>>>>>>>> Aron Tao > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: > >>>>>>>>> > >>>>>>>>>> Hi all, > >>>>>>>>>> > >>>>>>>>>> This is a proposal to extend the Calcite optimizer into a > general > >>>>>>>>>> incremental query optimizer, based on our research paper > >>>> published > >>>>> in > >>>>>>>>> VLDB > >>>>>>>>>> 2021: > >>>>>>>>>> Tempura: a general cost-based optimizer framework for > incremental > >>>>> data > >>>>>>>>>> processing > >>>>>>>>>> > >>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s > >>>> data > >>>>>>>>>> warehouse is planning to use this incremental query optimizer to > >>>>>>>>> alleviate > >>>>>>>>>> cluster-wise resource skewness: > >>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental > >>>>>>>>> Computing > >>>>>>>>>> > >>>>>>>>>> To our best knowledge, this is the first general cost-based > >>>>>>> incremental > >>>>>>>>>> optimizer that can find the best plan across multiple families > of > >>>>>>>>>> incremental computing methods, including IVM, Streaming, > >>>> DBToaster, > >>>>>>> etc. > >>>>>>>>>> Experiments (in the paper) shows that the generated best plan is > >>>>>>>>>> consistently much better than the plans from each individual > >>>> method > >>>>>>>>> alone. > >>>>>>>>>> > >>>>>>>>>> In general, incremental query planning is central to database > >>>> view > >>>>>>>>>> maintenance and stream processing systems, and are being adopted > >>>> in > >>>>>>>>> active > >>>>>>>>>> databases, resumable query execution, approximate query > >>>> processing, > >>>>>>> etc. > >>>>>>>>> We > >>>>>>>>>> are hoping that this feature can help widening the spectrum of > >>>>>>> Calcite, > >>>>>>>>>> solicit more use cases and adoption of Calcite. > >>>>>>>>>> > >>>>>>>>>> Below is a brief description of the technical details. Please > >>>> refer > >>>>> to > >>>>>>>>> the > >>>>>>>>>> Tempura paper for more details. We are also working on a journal > >>>>>>> version > >>>>>>>>> of > >>>>>>>>>> the paper with more implementation details. > >>>>>>>>>> > >>>>>>>>>> Currently the query plan generated by Calcite is meant to be > >>>>> executed > >>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be > >>>> extended > >>>>>>> with > >>>>>>>>>> temporal information so that it is capable of generating > >>>> incremental > >>>>>>>>> plans > >>>>>>>>>> that include multiple sub-plans to execute at different time > >>>> points. > >>>>>>>>>> > >>>>>>>>>> The main idea is to view each table as one that changes over > time > >>>>>>> (Time > >>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced > >>>> TvrMetaSet > >>>>>>> into > >>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related > >>>> RelSets > >>>>>>> of a > >>>>>>>>>> changing table (e.g. snapshot of the table at certain time, > >>>> delta of > >>>>>>> the > >>>>>>>>>> table between two time points, etc.). > >>>>>>>>>> > >>>>>>>>>> [image: image.png] > >>>>>>>>>> > >>>>>>>>>> For example in the above figure, each vertical line is a > >>>> TvrMetaSet > >>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). Horizontal > >>>>> lines > >>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users > can > >>>>>>> write > >>>>>>>>> TVR > >>>>>>>>>> Rewrite Rules to describe valid transformations between these > >>>> dots. > >>>>>>> For > >>>>>>>>>> example, the blues lines are inter-TVR rules that describe how > to > >>>>>>> compute > >>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red > lines > >>>>> are > >>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All > >>>> TVR > >>>>>>>>> rewrite > >>>>>>>>>> rules are logical rules. All existing Calcite rules still work > in > >>>>> the > >>>>>>> new > >>>>>>>>>> volcano system without modification. > >>>>>>>>>> > >>>>>>>>>> All changes in this feature will consist of four parts: > >>>>>>>>>> 1. Memo extension with TvrMetaSet > >>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and > >>>> RelNodes, > >>>>>>> as > >>>>>>>>>> well as links in between the nodes. > >>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule > >>>> engine > >>>>>>> API. > >>>>>>>>>> 4. Multi-query optimization, used to find the best incremental > >>>> plan > >>>>>>>>>> involving multiple time points. > >>>>>>>>>> > >>>>>>>>>> Note that this feature is an extension in nature and thus when > >>>>>>> disabled, > >>>>>>>>>> does not change any existing Calcite behavior. > >>>>>>>>>> > >>>>>>>>>> Other than scenarios in the paper, we also applied this > >>>>>>> Calcite-extended > >>>>>>>>>> incremental query optimizer to a type of periodic query called > >>>> the > >>>>>>>>> ‘‘range > >>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings of > >>>> 80% > >>>>>>> on > >>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end > execution > >>>>>>> time. > >>>>>>>>>> > >>>>>>>>>> All comments and suggestions are welcome. Thanks and happy > >>>> holidays! > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Botong > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> ~~~~~~~~~~~~~~~ > >>>> no mistakes > >>>> ~~~~~~~~~~~~~~~~~~ > >>>> > >>> >
