For me to participate in the discussion for the above questions, I will need to read a lot more to know relevant context and likely ask lots of questions :-). A editable doc is probably good for questions and back and forward discussion.
-Rui On Thu, Jan 28, 2021 at 10:50 AM Rui Wang <[email protected]> wrote: > I am also happy to help push this work into Calcite (review code and doc, > etc.). > > While you can share your code so people can have more idea how it is > implemented, I think it would be also nice to have a doc to discuss open > questions above. Some points that I copy those to here: > > 1. Can this solution be compatible with existing solutions in Calcite > Streaming, materialized view maintenance, and multi-query optimization > (Sigma and Delta relational operators, lattice, and Spool operator), > 2. Did you find that you needed two separate cost models - one for “view > maintenance” and another for “user queries” - since the objectives of each > activity are so different? > 3. whether this work will hasten the arrival of multi-objective parametric > query optimization [1] in Calcite. > 4. probably SQL shell support. > > > [1]: > https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext > > > -Rui > > > > On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote: > >> it would be very nice to see a POC of your work. >> >> >> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> wrote: >> >> > Hi Julian, >> > >> > Just wondering if there are any updates? We are wondering if it would >> help >> > to post our code for a quick preview. >> > >> > Thanks, >> > Botong >> > >> > On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote: >> > >> > > Hi Julian, >> > > >> > > Thanks for your interest! Sure let's figure out a plan that best >> benefits >> > > the community. Here are some clarifications that hopefully answer your >> > > questions. >> > > >> > > In our work (Tempura), users specify the set of time points to >> consider >> > > running and a cost function that expresses users' preference over >> time, >> > > Tempura will generate the best incremental plan that minimizes the >> > overall >> > > cost function. >> > > >> > > In this incremental plan, the sub-plans at different time points can >> be >> > > different from each other, as opposed to identical plans in all delta >> > runs >> > > as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we >> can >> > > mimic the current streaming implementation by specifying two (logical) >> > time >> > > points in Tempura, representing the initial run and later delta runs >> > > respectively. In general, note that Tempura supports various form of >> > > incremental computing, not only the small-delta append-only data >> model in >> > > streaming systems. That's why we believe Tempura subsumes the current >> > > streaming support, as well as any IVM implementations. >> > > >> > > About the cost model, we did not come up with a seperate cost model, >> but >> > > rather extended the existing one. Similar to multi-objective >> > optimization, >> > > costs incurred at different time points are considered different >> > > dimensions. Tempura lets users supply a function that converts this >> cost >> > > vector into a final cost. So under this function, any two incremental >> > plans >> > > are still comparable and there is an overall optimum. I guess we can >> go >> > > down the route of multi-objective parametric query optimization >> instead >> > if >> > > there is a need. >> > > >> > > Next on materialized views and multi-query optimization, since our >> > > multi-time-point plan naturally involves materializing intermediate >> > results >> > > for later time points, we need to solve the problem of choosing >> > > materializations and include the cost of saving and reusing the >> > > materializations when costing and comparing plans. We borrowed the >> > > multi-query optimization techniques to solve this problem even though >> we >> > > are looking at a single query. As a result, we think our work is >> > orthogonal >> > > to Calcite's facilities around utilizing existing views, lattice etc. >> We >> > do >> > > feel that the multi-query optimization component can be adopted to >> wider >> > > use, but probably need more suggestions from the community. >> > > >> > > Lastly, our current implementation is set up in java code, it should >> be >> > > straightforward to hook it up with SQL shell. >> > > >> > > Thanks, >> > > Botong >> > > >> > > On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]> >> > > wrote: >> > > >> > >> Botong, >> > >> >> > >> This is very exciting; congratulations on this research, and thank >> you >> > >> for contributing it back to Calcite. >> > >> >> > >> The research touches several areas in Calcite: streaming, >> materialized >> > >> view maintenance, and multi-query optimization. As we have already >> some >> > >> solutions in those areas (Sigma and Delta relational operators, >> lattice, >> > >> and Spool operator), it will be interesting to see whether we can >> make >> > them >> > >> compatible, or whether one concept can subsume others. >> > >> >> > >> Your work differs from streaming queries in that your relations are >> used >> > >> by “external” user queries, whereas in pure streaming queries, the >> only >> > >> activity is the change propagation. Did you find that you needed two >> > >> separate cost models - one for “view maintenance” and another for >> “user >> > >> queries” - since the objectives of each activity are so different? >> > >> >> > >> I wonder whether this work will hasten the arrival of multi-objective >> > >> parametric query optimization [1] in Calcite. >> > >> >> > >> I will make time over the next few days to read and digest your >> paper. >> > >> Then I expect that we will have a back-and-forth process to create >> > >> something that will be useful for the broader community. >> > >> >> > >> One thing will be particularly useful: making this functionality >> > >> available from a SQL shell, so that people can experiment with this >> > >> functionality without writing Java code or setting up complex >> databases >> > and >> > >> metadata. I have in mind something like the simple DDL operations >> that >> > are >> > >> available in Calcite’s ’server’ module. I wonder whether we could >> devise >> > >> some kind of SQL syntax for a “multi-query”. >> > >> >> > >> Julian >> > >> >> > >> [1] >> > >> >> > >> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext >> > >> >> > >> >> > >> >> > >> > On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> >> wrote: >> > >> > >> > >> > Thanks Aron for pointing this out. To see the figure, please refer >> to >> > >> Fig >> > >> > 3(a) in our paper: >> > >> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf >> > >> > >> > >> > Best, >> > >> > Botong >> > >> > >> > >> > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> >> > wrote: >> > >> > >> > >> >> Seems interesting, the pic can not be seen in the mail, may you >> open >> > a >> > >> JIRA >> > >> >> for this, people who are interested in this can subscribe to the >> > JIRA? >> > >> >> >> > >> >> >> > >> >> Regards! >> > >> >> >> > >> >> Aron Tao >> > >> >> >> > >> >> >> > >> >> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: >> > >> >> >> > >> >>> Hi all, >> > >> >>> >> > >> >>> This is a proposal to extend the Calcite optimizer into a general >> > >> >>> incremental query optimizer, based on our research paper >> published >> > in >> > >> >> VLDB >> > >> >>> 2021: >> > >> >>> Tempura: a general cost-based optimizer framework for incremental >> > data >> > >> >>> processing >> > >> >>> >> > >> >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s >> data >> > >> >>> warehouse is planning to use this incremental query optimizer to >> > >> >> alleviate >> > >> >>> cluster-wise resource skewness: >> > >> >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental >> > >> >> Computing >> > >> >>> >> > >> >>> To our best knowledge, this is the first general cost-based >> > >> incremental >> > >> >>> optimizer that can find the best plan across multiple families of >> > >> >>> incremental computing methods, including IVM, Streaming, >> DBToaster, >> > >> etc. >> > >> >>> Experiments (in the paper) shows that the generated best plan is >> > >> >>> consistently much better than the plans from each individual >> method >> > >> >> alone. >> > >> >>> >> > >> >>> In general, incremental query planning is central to database >> view >> > >> >>> maintenance and stream processing systems, and are being adopted >> in >> > >> >> active >> > >> >>> databases, resumable query execution, approximate query >> processing, >> > >> etc. >> > >> >> We >> > >> >>> are hoping that this feature can help widening the spectrum of >> > >> Calcite, >> > >> >>> solicit more use cases and adoption of Calcite. >> > >> >>> >> > >> >>> Below is a brief description of the technical details. Please >> refer >> > to >> > >> >> the >> > >> >>> Tempura paper for more details. We are also working on a journal >> > >> version >> > >> >> of >> > >> >>> the paper with more implementation details. >> > >> >>> >> > >> >>> Currently the query plan generated by Calcite is meant to be >> > executed >> > >> >>> altogether at once. In the proposal, Calcite’s memo will be >> extended >> > >> with >> > >> >>> temporal information so that it is capable of generating >> incremental >> > >> >> plans >> > >> >>> that include multiple sub-plans to execute at different time >> points. >> > >> >>> >> > >> >>> The main idea is to view each table as one that changes over time >> > >> (Time >> > >> >>> Varying Relations (TVR)). To achieve that we introduced >> TvrMetaSet >> > >> into >> > >> >>> Calcite’s memo besides RelSet and RelSubset to track related >> RelSets >> > >> of a >> > >> >>> changing table (e.g. snapshot of the table at certain time, >> delta of >> > >> the >> > >> >>> table between two time points, etc.). >> > >> >>> >> > >> >>> [image: image.png] >> > >> >>> >> > >> >>> For example in the above figure, each vertical line is a >> TvrMetaSet >> > >> >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal >> > lines >> > >> >>> represent time. Each black dot in the grid is a RelSet. Users can >> > >> write >> > >> >> TVR >> > >> >>> Rewrite Rules to describe valid transformations between these >> dots. >> > >> For >> > >> >>> example, the blues lines are inter-TVR rules that describe how to >> > >> compute >> > >> >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines >> > are >> > >> >>> intra-TVR rules that describe transformations within a TVR. All >> TVR >> > >> >> rewrite >> > >> >>> rules are logical rules. All existing Calcite rules still work in >> > the >> > >> new >> > >> >>> volcano system without modification. >> > >> >>> >> > >> >>> All changes in this feature will consist of four parts: >> > >> >>> 1. Memo extension with TvrMetaSet >> > >> >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and >> RelNodes, >> > >> as >> > >> >>> well as links in between the nodes. >> > >> >>> 3. A basic set of TvrRules, written using the upgraded rule >> engine >> > >> API. >> > >> >>> 4. Multi-query optimization, used to find the best incremental >> plan >> > >> >>> involving multiple time points. >> > >> >>> >> > >> >>> Note that this feature is an extension in nature and thus when >> > >> disabled, >> > >> >>> does not change any existing Calcite behavior. >> > >> >>> >> > >> >>> Other than scenarios in the paper, we also applied this >> > >> Calcite-extended >> > >> >>> incremental query optimizer to a type of periodic query called >> the >> > >> >> ‘‘range >> > >> >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of >> 80% >> > >> on >> > >> >>> total CPU and memory consumption, and 60% on end-to-end execution >> > >> time. >> > >> >>> >> > >> >>> All comments and suggestions are welcome. Thanks and happy >> holidays! >> > >> >>> >> > >> >>> Best, >> > >> >>> Botong >> > >> >>> >> > >> >> >> > >> >> > >> >> > >> >> >> -- >> ~~~~~~~~~~~~~~~ >> no mistakes >> ~~~~~~~~~~~~~~~~~~ >> >
