I am also happy to help push this work into Calcite (review code and doc, etc.).
While you can share your code so people can have more idea how it is implemented, I think it would be also nice to have a doc to discuss open questions above. Some points that I copy those to here: 1. Can this solution be compatible with existing solutions in Calcite Streaming, materialized view maintenance, and multi-query optimization (Sigma and Delta relational operators, lattice, and Spool operator), 2. Did you find that you needed two separate cost models - one for “view maintenance” and another for “user queries” - since the objectives of each activity are so different? 3. whether this work will hasten the arrival of multi-objective parametric query optimization [1] in Calcite. 4. probably SQL shell support. [1]: https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext -Rui On Wed, Jan 27, 2021 at 6:52 PM Albert <[email protected]> wrote: > it would be very nice to see a POC of your work. > > > On Thu, Jan 28, 2021 at 10:21 AM Botong Huang <[email protected]> wrote: > > > Hi Julian, > > > > Just wondering if there are any updates? We are wondering if it would > help > > to post our code for a quick preview. > > > > Thanks, > > Botong > > > > On Fri, Jan 1, 2021 at 11:04 AM Botong Huang <[email protected]> wrote: > > > > > Hi Julian, > > > > > > Thanks for your interest! Sure let's figure out a plan that best > benefits > > > the community. Here are some clarifications that hopefully answer your > > > questions. > > > > > > In our work (Tempura), users specify the set of time points to consider > > > running and a cost function that expresses users' preference over time, > > > Tempura will generate the best incremental plan that minimizes the > > overall > > > cost function. > > > > > > In this incremental plan, the sub-plans at different time points can be > > > different from each other, as opposed to identical plans in all delta > > runs > > > as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we > can > > > mimic the current streaming implementation by specifying two (logical) > > time > > > points in Tempura, representing the initial run and later delta runs > > > respectively. In general, note that Tempura supports various form of > > > incremental computing, not only the small-delta append-only data model > in > > > streaming systems. That's why we believe Tempura subsumes the current > > > streaming support, as well as any IVM implementations. > > > > > > About the cost model, we did not come up with a seperate cost model, > but > > > rather extended the existing one. Similar to multi-objective > > optimization, > > > costs incurred at different time points are considered different > > > dimensions. Tempura lets users supply a function that converts this > cost > > > vector into a final cost. So under this function, any two incremental > > plans > > > are still comparable and there is an overall optimum. I guess we can go > > > down the route of multi-objective parametric query optimization instead > > if > > > there is a need. > > > > > > Next on materialized views and multi-query optimization, since our > > > multi-time-point plan naturally involves materializing intermediate > > results > > > for later time points, we need to solve the problem of choosing > > > materializations and include the cost of saving and reusing the > > > materializations when costing and comparing plans. We borrowed the > > > multi-query optimization techniques to solve this problem even though > we > > > are looking at a single query. As a result, we think our work is > > orthogonal > > > to Calcite's facilities around utilizing existing views, lattice etc. > We > > do > > > feel that the multi-query optimization component can be adopted to > wider > > > use, but probably need more suggestions from the community. > > > > > > Lastly, our current implementation is set up in java code, it should be > > > straightforward to hook it up with SQL shell. > > > > > > Thanks, > > > Botong > > > > > > On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <[email protected]> > > > wrote: > > > > > >> Botong, > > >> > > >> This is very exciting; congratulations on this research, and thank you > > >> for contributing it back to Calcite. > > >> > > >> The research touches several areas in Calcite: streaming, materialized > > >> view maintenance, and multi-query optimization. As we have already > some > > >> solutions in those areas (Sigma and Delta relational operators, > lattice, > > >> and Spool operator), it will be interesting to see whether we can make > > them > > >> compatible, or whether one concept can subsume others. > > >> > > >> Your work differs from streaming queries in that your relations are > used > > >> by “external” user queries, whereas in pure streaming queries, the > only > > >> activity is the change propagation. Did you find that you needed two > > >> separate cost models - one for “view maintenance” and another for > “user > > >> queries” - since the objectives of each activity are so different? > > >> > > >> I wonder whether this work will hasten the arrival of multi-objective > > >> parametric query optimization [1] in Calcite. > > >> > > >> I will make time over the next few days to read and digest your paper. > > >> Then I expect that we will have a back-and-forth process to create > > >> something that will be useful for the broader community. > > >> > > >> One thing will be particularly useful: making this functionality > > >> available from a SQL shell, so that people can experiment with this > > >> functionality without writing Java code or setting up complex > databases > > and > > >> metadata. I have in mind something like the simple DDL operations that > > are > > >> available in Calcite’s ’server’ module. I wonder whether we could > devise > > >> some kind of SQL syntax for a “multi-query”. > > >> > > >> Julian > > >> > > >> [1] > > >> > > > https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext > > >> > > >> > > >> > > >> > On Dec 23, 2020, at 8:55 PM, Botong Huang <[email protected]> wrote: > > >> > > > >> > Thanks Aron for pointing this out. To see the figure, please refer > to > > >> Fig > > >> > 3(a) in our paper: > > >> https://kai-zeng.github.io/papers/tempura-vldb2021.pdf > > >> > > > >> > Best, > > >> > Botong > > >> > > > >> > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <[email protected]> > > wrote: > > >> > > > >> >> Seems interesting, the pic can not be seen in the mail, may you > open > > a > > >> JIRA > > >> >> for this, people who are interested in this can subscribe to the > > JIRA? > > >> >> > > >> >> > > >> >> Regards! > > >> >> > > >> >> Aron Tao > > >> >> > > >> >> > > >> >> Botong Huang <[email protected]> 于2020年12月24日周四 上午3:18写道: > > >> >> > > >> >>> Hi all, > > >> >>> > > >> >>> This is a proposal to extend the Calcite optimizer into a general > > >> >>> incremental query optimizer, based on our research paper published > > in > > >> >> VLDB > > >> >>> 2021: > > >> >>> Tempura: a general cost-based optimizer framework for incremental > > data > > >> >>> processing > > >> >>> > > >> >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data > > >> >>> warehouse is planning to use this incremental query optimizer to > > >> >> alleviate > > >> >>> cluster-wise resource skewness: > > >> >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental > > >> >> Computing > > >> >>> > > >> >>> To our best knowledge, this is the first general cost-based > > >> incremental > > >> >>> optimizer that can find the best plan across multiple families of > > >> >>> incremental computing methods, including IVM, Streaming, > DBToaster, > > >> etc. > > >> >>> Experiments (in the paper) shows that the generated best plan is > > >> >>> consistently much better than the plans from each individual > method > > >> >> alone. > > >> >>> > > >> >>> In general, incremental query planning is central to database view > > >> >>> maintenance and stream processing systems, and are being adopted > in > > >> >> active > > >> >>> databases, resumable query execution, approximate query > processing, > > >> etc. > > >> >> We > > >> >>> are hoping that this feature can help widening the spectrum of > > >> Calcite, > > >> >>> solicit more use cases and adoption of Calcite. > > >> >>> > > >> >>> Below is a brief description of the technical details. Please > refer > > to > > >> >> the > > >> >>> Tempura paper for more details. We are also working on a journal > > >> version > > >> >> of > > >> >>> the paper with more implementation details. > > >> >>> > > >> >>> Currently the query plan generated by Calcite is meant to be > > executed > > >> >>> altogether at once. In the proposal, Calcite’s memo will be > extended > > >> with > > >> >>> temporal information so that it is capable of generating > incremental > > >> >> plans > > >> >>> that include multiple sub-plans to execute at different time > points. > > >> >>> > > >> >>> The main idea is to view each table as one that changes over time > > >> (Time > > >> >>> Varying Relations (TVR)). To achieve that we introduced TvrMetaSet > > >> into > > >> >>> Calcite’s memo besides RelSet and RelSubset to track related > RelSets > > >> of a > > >> >>> changing table (e.g. snapshot of the table at certain time, delta > of > > >> the > > >> >>> table between two time points, etc.). > > >> >>> > > >> >>> [image: image.png] > > >> >>> > > >> >>> For example in the above figure, each vertical line is a > TvrMetaSet > > >> >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal > > lines > > >> >>> represent time. Each black dot in the grid is a RelSet. Users can > > >> write > > >> >> TVR > > >> >>> Rewrite Rules to describe valid transformations between these > dots. > > >> For > > >> >>> example, the blues lines are inter-TVR rules that describe how to > > >> compute > > >> >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines > > are > > >> >>> intra-TVR rules that describe transformations within a TVR. All > TVR > > >> >> rewrite > > >> >>> rules are logical rules. All existing Calcite rules still work in > > the > > >> new > > >> >>> volcano system without modification. > > >> >>> > > >> >>> All changes in this feature will consist of four parts: > > >> >>> 1. Memo extension with TvrMetaSet > > >> >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and > RelNodes, > > >> as > > >> >>> well as links in between the nodes. > > >> >>> 3. A basic set of TvrRules, written using the upgraded rule engine > > >> API. > > >> >>> 4. Multi-query optimization, used to find the best incremental > plan > > >> >>> involving multiple time points. > > >> >>> > > >> >>> Note that this feature is an extension in nature and thus when > > >> disabled, > > >> >>> does not change any existing Calcite behavior. > > >> >>> > > >> >>> Other than scenarios in the paper, we also applied this > > >> Calcite-extended > > >> >>> incremental query optimizer to a type of periodic query called the > > >> >> ‘‘range > > >> >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of > 80% > > >> on > > >> >>> total CPU and memory consumption, and 60% on end-to-end execution > > >> time. > > >> >>> > > >> >>> All comments and suggestions are welcome. Thanks and happy > holidays! > > >> >>> > > >> >>> Best, > > >> >>> Botong > > >> >>> > > >> >> > > >> > > >> > > > > > -- > ~~~~~~~~~~~~~~~ > no mistakes > ~~~~~~~~~~~~~~~~~~ >
