Thank you for raising this issue; the thread is really interesting. Most of my use cases involve spatial data processing and visualization.
Spatial data processing usually involves the execution complex Directed Acyclic Graphs (DAGs) of relatively expensive tasks that need to be optimized and executed in parallel. For instance, here is an bird-eye of what Apache Baremaps does with PostgreSQL to produce vector maps [1]: 1) Download the latest OpenStreetMap planet file and other relevant data; the OSM models the data with nodes (i.e., points), ways (i.e., sequences of nodes representing lines or polygons), and relations (i.e., sequences of ways and nodes representing more complex geometries). 2) Create temporary tables with all the nodes and references between ways and nodes to create geometries. 3) Create tables with the planet file and the temporary tables that contain nodes, ways, and relations, all containing actual geometries instead of references. 4) Create materialized views that contain cleaner records (merging the lines of the road network, merging contiguous polygons of the same type such as forests, etc.). 5) Create materialized views that simplify the geometries at different zoom levels (world, continent, country, city, street, etc.). 6) Query the tables and materialized views to generate a large cache containing preprocessed files that can be visualized in the browser. Executing such a DAG for the planet requires a decent amount of storage and compute power. As the process of designing maps is relatively empirical, one cannot afford to re-execute the whole DAG to test small changes (e.g., tweaking the simplifications applied to the road network) and intermediary results need to be preserved. Another significant difficulty arises when the OpenStreetMap data gets updated: one approach runs the whole DAG again; a better approach would selectively update the tables, materialized views, and cache with the delta. The syntax we currently use to represent our DAG is limited and not satisfactory [2], for instance, the inputs and outputs of the tasks could be used to define the edges. At some point, I wondered if all of this could be replaced with incremental view maintenance (IVM). To some extent, one could create a DAG where each vertex is a table or materialized view. Then, inserting new data on one end would update the data on the other end. Overall, it really feels like a multi-query optimization problem. What I just describes probably feels like a wobbly assembly ;) and I started contributing to Apache Calcite when looking for better solutions. I’d be really happy to help! [1] https://demo.baremaps.com/#14/46.58954/6.55072 <https://pmtiles.baremaps.com/#14/46.5197/6.6323> [2] https://github.com/apache/incubator-baremaps/blob/main/basemap/import.js > On 4 Jan 2024, at 21:52, Andrii Tsvielodub <[email protected]> wrote: > > This would be an exciting extension. > I have experience using Calcite as a backend for a visualization platform, > and we had to develop similar functionality on top, as it was essential for > many optimizations we had in mind. > > Here's a quick outline of our use cases: > - Re-use of intermediate results. Two flavors here: > -- A common part of a request is extracted, and the data is cached in the > scope of one query. This could have probably been implemented using spool, > but we did it using a custom optimization program. > -- One query is pushed down into the DB, and then another query is run on > top of that result and produces the final resultset. This was probably the > most frequently used optimization. The first query did partial aggregation > in the DB, which was very beneficial for performance. > - Using the results of one query in another query. There were numerous > posts on the mailing list about building a dynamic filter for pushdown. We > did this by simply running several queries sequentially. For us, the > benefit of offloading both queries to the DB outweighed the drawback of > possibly scanning the same data twice. > - Re-using intermediate result(again), but returning several resultsets. > Doing several levels of aggregation on top of raw data(or the smallest > aggregation level). > - Running several independent queries on top of the same dataset. It is > probably not much of an optimization from the DB standpoint (if we ignore > all the smart multi-query optimizations, which I think not too many DBs > actually currently have). However, it had some minor communication-level > improvements for us as a data layer of the product. > > This was all implemented on top of Calcite, by building rel trees of > different complexity. Some of that logic was abstracted into our > "Statement" object; others were just pieces of logic during Rel tree > building. But I think most of those optimizations could be implemented in > Calcite, if there are enough extension points to plug in custom logic. > > There's probably nothing new here, but I hope this confirms such > functionality's usefulness. > > Would be happy to help with the design/implementation. > > On Thu, Jan 4, 2024 at 10:47 PM Julian Hyde <[email protected]> wrote: > >> I didn’t actually have an algorithm in mind. I wanted to kick off a >> research project, by saying ’This is a common problem, and here is a >> language for describing it. Does anyone have ideas for how to solve it?’. >> >> The issue keeps coming up, and I want to be able to say ‘Yes, we’ve been >> thinking about that in Calcite’. >> >> So, I’ll go ahead and log that Jira case. I’d welcome improvements — >> better SQL syntax, better set of use cases — from people who have been >> working in this space. >> >> I hadn’t thought about field trimming across trees, but that sounds very >> important and useful. >> >> I suspect that the Volcano approach will get us 80% of the way there. >> Dynamic programming doesn’t really care whether we are working on a tree, a >> forest, or a collection of trees loosely connected into a DAG. The only >> problem — as noted in https://issues.apache.org/jira/browse/CALCITE-481 — >> is that Volcano’s cost model will tend to double-count. >> >> But my goal isn’t to solve the problem. Once we have a good set of use >> cases (pure query, pure DML, hybrid query+DML, cyclic data flows) expressed >> in SQL, we can start discussing algorithms. >> >> >>> On Jan 3, 2024, at 9:25 PM, Benchao Li <[email protected]> wrote: >>> >>> This is a very fundamental use case in Flink. And in Flink, we have a >>> mechanism that breaks the DAG into sub trees[1] and then utilizes >>> Calcite to do the optimization of each sub tree. This prevents many >>> optimization chances across sub trees, such as field trimming, filter >>> pushdown, and many more transposing rules to utilize cost-based >>> optimization. >>> >>> The benefit we get via that way is that we can avoid many complex >>> optimization corner cases since sub trees are simpler than the whole >>> DAG, we even hacked to utilize this non formal character to work >>> around some planner corner cases (could not produce a valid plan / >>> optimization dead loops) temporarily before we finally solved the root >>> cause. >>> >>> Solving this problem in Calcite sounds good to me, the only one >>> concern is the optimization complexity (I'm not sure how you planned >>> to do it yet, it's just a heads up from my experience). I know it's >>> more related to how users choose the rule sets and the complexity of >>> queries, however it may be one of the factors that affects the >>> usability of multi-query optimization. >>> >>> [1] >> https://github.com/apache/flink/blob/8e8e2d649e98071bc15ee768cf65da3e96f255b4/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/optimize/CommonSubGraphBasedOptimizer.scala#L58 >>> >>> Forward Xu <[email protected]> 于2024年1月4日周四 10:51写道: >>>> >>>> Thanks to the community for the related work that has been done, This >> is a >>>> good discussion, and I think it can be used in, for example, better >>>> multi-query optimization algorithms, more effective query result caching >>>> strategies, smarter materialized view selection and maintenance methods, >>>> etc. These directions require further research and exploration. >>>> >>>> Finally here is a related paper MULTI-QUERY OPTIMIZATION AND >> APPLICATIONS] >>>> < >> https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2015/2014/Papers/prasan-thesis.pdf >>> >>>> . >>>> >>>> Best, >>>> ForwardXu >>>> >>>> Julian Hyde <[email protected]> 于2024年1月4日周四 04:09写道: >>>> >>>>> Multi-query optimization is one of the big challenges of our field. >>>>> Examples of multi-queries: >>>>> * an INSERT statement that writes into a table but also updates an >> index, >>>>> * a DAG that represents an ETL/ELT job; >>>>> * a query that produces several data sets (say a list of invoices for >>>>> orders and a list of products that need to be restocked); >>>>> * a query that uses intermediate results more than once. >>>>> >>>>> Common features of multi-queries are multiple output data sets, and >>>>> re-used intermediate results. In other words, the dataflow graph is a >>>>> DAG rather than just a tree. >>>>> >>>>> I think it would be useful to frame the problem by extending SQL so >>>>> that such multi-queries can be represented as a single unit. From that >>>>> would follow extensions to relational algebra, and improvements to >>>>> planner algorithms and cost models. >>>>> >>>>> I would like to hear people's thoughts before I log a Jira case with a >>>>> sketch of the problem. >>>>> >>>>> Related work: >>>>> * https://issues.apache.org/jira/browse/CALCITE-481 Spool operator >>>>> * https://issues.apache.org/jira/browse/CALCITE-1440 Multiple RelNodes >>>>> * https://issues.apache.org/jira/browse/CALCITE-4568 Incremental >>>>> query optimization >>>>> * https://issues.apache.org/jira/browse/CALCITE-129 Recursive queries >>>>> >>>>> Julian >>>>> >>> >>> >>> >>> -- >>> >>> Best, >>> Benchao Li >> >>
signature.asc
Description: Message signed with OpenPGP
