PS The “editable doc” that Rui refers to is also a good idea. I think we should create it to continue discussion after the first meeting.
Julian > On Jan 28, 2021, at 11:16 AM, Julian Hyde <[email protected]> wrote: > > I think good next steps would be a PR and a meeting. The PR will allow us to > read the code, but I think we should do the first round of questions at the > meeting. The meeting could perhaps start with a presentation of the paper > (do you have some slides you are planning to present at VLDB, Botong?) and > then move on to questions about the concepts, which alternatives were > considered, and how the concepts map onto other current and future concepts > in calcite. > > I don’t think we should start “reviewing” the PR line-by-line at this point. > We need to understand the high-level concepts and design choices. If we start > reviewing the PR we will get lost in the details. > > I know that integrating a major change is hard; I doubt that we will be able > to integrate everything, but we can build understanding about where calcite > needs to go, and I hope integrate a good amount of code to help us get there. > > As I said before, after the integration I would like people to be able to > experiment with it and use it in their production systems. That way, it will > not be an experiment that withers, but a feature set integrates with other > calcite features and gets stronger over time. > > Julian > >> On Jan 28, 2021, at 10:54 AM, Rui Wang <[email protected]> wrote: >> >> For me to participate in the discussion for the above questions, I will >> need to read a lot more to know relevant context and likely ask lots of >> questions :-). A editable doc is probably good for questions and back and >> forward discussion. >> >> >> -Rui >> >>>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> wrote: >>> >>> I am also happy to help push this work into Calcite (review code and doc, >>> etc.). >>> >>> While you can share your code so people can have more idea how it is >>> implemented, I think it would be also nice to have a doc to discuss open >>> questions above. Some points that I copy those to here: >>> >>> 1. Can this solution be compatible with existing solutions in Calcite >>> Streaming, materialized view maintenance, and multi-query optimization >>> (Sigma and Delta relational operators, lattice, and Spool operator), >>> 2. Did you find that you needed two separate cost models - one for “view >>> maintenance” and another for “user queries” - since the objectives of each >>> activity are so different? >>> 3. whether this work will hasten the arrival of multi-objective parametric >>> query optimization [1] in Calcite. >>> 4. probably SQL shell support. >>> >>> >>> [1]: >>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >>> >>> >>> -Rui >>> >>> >>> >>>> On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote: >>>> >>>> it would be very nice to see a POC of your work. >>>> >>>> >>>>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> wrote: >>>> >>>>> Hi Julian, >>>>> >>>>> Just wondering if there are any updates? We are wondering if it would >>>> help >>>>> to post our code for a quick preview. >>>>> >>>>> Thanks, >>>>> Botong >>>>> >>>>> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote: >>>>> >>>>>> Hi Julian, >>>>>> >>>>>> Thanks for your interest! Sure let's figure out a plan that best >>>> benefits >>>>>> the community. Here are some clarifications that hopefully answer your >>>>>> questions. >>>>>> >>>>>> In our work (Tempura), users specify the set of time points to >>>> consider >>>>>> running and a cost function that expresses users' preference over >>>> time, >>>>>> Tempura will generate the best incremental plan that minimizes the >>>>> overall >>>>>> cost function. >>>>>> >>>>>> In this incremental plan, the sub-plans at different time points can >>>> be >>>>>> different from each other, as opposed to identical plans in all delta >>>>> runs >>>>>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we >>>> can >>>>>> mimic the current streaming implementation by specifying two (logical) >>>>> time >>>>>> points in Tempura, representing the initial run and later delta runs >>>>>> respectively. In general, note that Tempura supports various form of >>>>>> incremental computing, not only the small-delta append-only data >>>> model in >>>>>> streaming systems. That's why we believe Tempura subsumes the current >>>>>> streaming support, as well as any IVM implementations. >>>>>> >>>>>> About the cost model, we did not come up with a seperate cost model, >>>> but >>>>>> rather extended the existing one. Similar to multi-objective >>>>> optimization, >>>>>> costs incurred at different time points are considered different >>>>>> dimensions. Tempura lets users supply a function that converts this >>>> cost >>>>>> vector into a final cost. So under this function, any two incremental >>>>> plans >>>>>> are still comparable and there is an overall optimum. I guess we can >>>> go >>>>>> down the route of multi-objective parametric query optimization >>>> instead >>>>> if >>>>>> there is a need. >>>>>> >>>>>> Next on materialized views and multi-query optimization, since our >>>>>> multi-time-point plan naturally involves materializing intermediate >>>>> results >>>>>> for later time points, we need to solve the problem of choosing >>>>>> materializations and include the cost of saving and reusing the >>>>>> materializations when costing and comparing plans. We borrowed the >>>>>> multi-query optimization techniques to solve this problem even though >>>> we >>>>>> are looking at a single query. As a result, we think our work is >>>>> orthogonal >>>>>> to Calcite's facilities around utilizing existing views, lattice etc. >>>> We >>>>> do >>>>>> feel that the multi-query optimization component can be adopted to >>>> wider >>>>>> use, but probably need more suggestions from the community. >>>>>> >>>>>> Lastly, our current implementation is set up in java code, it should >>>> be >>>>>> straightforward to hook it up with SQL shell. >>>>>> >>>>>> Thanks, >>>>>> Botong >>>>>> >>>>>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Botong, >>>>>>> >>>>>>> This is very exciting; congratulations on this research, and thank >>>> you >>>>>>> for contributing it back to Calcite. >>>>>>> >>>>>>> The research touches several areas in Calcite: streaming, >>>> materialized >>>>>>> view maintenance, and multi-query optimization. As we have already >>>> some >>>>>>> solutions in those areas (Sigma and Delta relational operators, >>>> lattice, >>>>>>> and Spool operator), it will be interesting to see whether we can >>>> make >>>>> them >>>>>>> compatible, or whether one concept can subsume others. >>>>>>> >>>>>>> Your work differs from streaming queries in that your relations are >>>> used >>>>>>> by “external” user queries, whereas in pure streaming queries, the >>>> only >>>>>>> activity is the change propagation. Did you find that you needed two >>>>>>> separate cost models - one for “view maintenance” and another for >>>> “user >>>>>>> queries” - since the objectives of each activity are so different? >>>>>>> >>>>>>> I wonder whether this work will hasten the arrival of multi-objective >>>>>>> parametric query optimization [1] in Calcite. >>>>>>> >>>>>>> I will make time over the next few days to read and digest your >>>> paper. >>>>>>> Then I expect that we will have a back-and-forth process to create >>>>>>> something that will be useful for the broader community. >>>>>>> >>>>>>> One thing will be particularly useful: making this functionality >>>>>>> available from a SQL shell, so that people can experiment with this >>>>>>> functionality without writing Java code or setting up complex >>>> databases >>>>> and >>>>>>> metadata. I have in mind something like the simple DDL operations >>>> that >>>>> are >>>>>>> available in Calcite’s ’server’ module. I wonder whether we could >>>> devise >>>>>>> some kind of SQL syntax for a “multi-query”. >>>>>>> >>>>>>> Julian >>>>>>> >>>>>>> [1] >>>>>>> >>>>> >>>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> >>>> wrote: >>>>>>>> >>>>>>>> Thanks Aron for pointing this out. To see the figure, please refer >>>> to >>>>>>> Fig >>>>>>>> 3(a) in our paper: >>>>>>> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf >>>>>>>> >>>>>>>> Best, >>>>>>>> Botong >>>>>>>> >>>>>>>> On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> >>>>> wrote: >>>>>>>> >>>>>>>>> Seems interesting, the pic can not be seen in the mail, may you >>>> open >>>>> a >>>>>>> JIRA >>>>>>>>> for this, people who are interested in this can subscribe to the >>>>> JIRA? >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards! >>>>>>>>> >>>>>>>>> Aron Tao >>>>>>>>> >>>>>>>>> >>>>>>>>> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> This is a proposal to extend the Calcite optimizer into a general >>>>>>>>>> incremental query optimizer, based on our research paper >>>> published >>>>> in >>>>>>>>> VLDB >>>>>>>>>> 2021: >>>>>>>>>> Tempura: a general cost-based optimizer framework for incremental >>>>> data >>>>>>>>>> processing >>>>>>>>>> >>>>>>>>>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s >>>> data >>>>>>>>>> warehouse is planning to use this incremental query optimizer to >>>>>>>>> alleviate >>>>>>>>>> cluster-wise resource skewness: >>>>>>>>>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental >>>>>>>>> Computing >>>>>>>>>> >>>>>>>>>> To our best knowledge, this is the first general cost-based >>>>>>> incremental >>>>>>>>>> optimizer that can find the best plan across multiple families of >>>>>>>>>> incremental computing methods, including IVM, Streaming, >>>> DBToaster, >>>>>>> etc. >>>>>>>>>> Experiments (in the paper) shows that the generated best plan is >>>>>>>>>> consistently much better than the plans from each individual >>>> method >>>>>>>>> alone. >>>>>>>>>> >>>>>>>>>> In general, incremental query planning is central to database >>>> view >>>>>>>>>> maintenance and stream processing systems, and are being adopted >>>> in >>>>>>>>> active >>>>>>>>>> databases, resumable query execution, approximate query >>>> processing, >>>>>>> etc. >>>>>>>>> We >>>>>>>>>> are hoping that this feature can help widening the spectrum of >>>>>>> Calcite, >>>>>>>>>> solicit more use cases and adoption of Calcite. >>>>>>>>>> >>>>>>>>>> Below is a brief description of the technical details. Please >>>> refer >>>>> to >>>>>>>>> the >>>>>>>>>> Tempura paper for more details. We are also working on a journal >>>>>>> version >>>>>>>>> of >>>>>>>>>> the paper with more implementation details. >>>>>>>>>> >>>>>>>>>> Currently the query plan generated by Calcite is meant to be >>>>> executed >>>>>>>>>> altogether at once. In the proposal, Calcite’s memo will be >>>> extended >>>>>>> with >>>>>>>>>> temporal information so that it is capable of generating >>>> incremental >>>>>>>>> plans >>>>>>>>>> that include multiple sub-plans to execute at different time >>>> points. >>>>>>>>>> >>>>>>>>>> The main idea is to view each table as one that changes over time >>>>>>> (Time >>>>>>>>>> Varying Relations (TVR)). To achieve that we introduced >>>> TvrMetaSet >>>>>>> into >>>>>>>>>> Calcite’s memo besides RelSet and RelSubset to track related >>>> RelSets >>>>>>> of a >>>>>>>>>> changing table (e.g. snapshot of the table at certain time, >>>> delta of >>>>>>> the >>>>>>>>>> table between two time points, etc.). >>>>>>>>>> >>>>>>>>>> [image: image.png] >>>>>>>>>> >>>>>>>>>> For example in the above figure, each vertical line is a >>>> TvrMetaSet >>>>>>>>>> representing a TVR (S, R, S left outer join R, etc.). Horizontal >>>>> lines >>>>>>>>>> represent time. Each black dot in the grid is a RelSet. Users can >>>>>>> write >>>>>>>>> TVR >>>>>>>>>> Rewrite Rules to describe valid transformations between these >>>> dots. >>>>>>> For >>>>>>>>>> example, the blues lines are inter-TVR rules that describe how to >>>>>>> compute >>>>>>>>>> certain RelSet of a TVR from RelSets of other TVRs. The red lines >>>>> are >>>>>>>>>> intra-TVR rules that describe transformations within a TVR. All >>>> TVR >>>>>>>>> rewrite >>>>>>>>>> rules are logical rules. All existing Calcite rules still work in >>>>> the >>>>>>> new >>>>>>>>>> volcano system without modification. >>>>>>>>>> >>>>>>>>>> All changes in this feature will consist of four parts: >>>>>>>>>> 1. Memo extension with TvrMetaSet >>>>>>>>>> 2. Rule engine upgrade, capable of matching TvrMetaSet and >>>> RelNodes, >>>>>>> as >>>>>>>>>> well as links in between the nodes. >>>>>>>>>> 3. A basic set of TvrRules, written using the upgraded rule >>>> engine >>>>>>> API. >>>>>>>>>> 4. Multi-query optimization, used to find the best incremental >>>> plan >>>>>>>>>> involving multiple time points. >>>>>>>>>> >>>>>>>>>> Note that this feature is an extension in nature and thus when >>>>>>> disabled, >>>>>>>>>> does not change any existing Calcite behavior. >>>>>>>>>> >>>>>>>>>> Other than scenarios in the paper, we also applied this >>>>>>> Calcite-extended >>>>>>>>>> incremental query optimizer to a type of periodic query called >>>> the >>>>>>>>> ‘‘range >>>>>>>>>> query’’ in Alibaba’s data warehouse. It achieved cost savings of >>>> 80% >>>>>>> on >>>>>>>>>> total CPU and memory consumption, and 60% on end-to-end execution >>>>>>> time. >>>>>>>>>> >>>>>>>>>> All comments and suggestions are welcome. Thanks and happy >>>> holidays! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Botong >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> ~~~~~~~~~~~~~~~ >>>> no mistakes >>>> ~~~~~~~~~~~~~~~~~~ >>>> >>>
