gianm commented on issue #12262: URL: https://github.com/apache/druid/issues/12262#issuecomment-1183728238
> The way we achieved these goals in other systems (Drill, Presto, Impala, Hive, etc.) is to use the classic optimizer + operator DAG architecture. The optimizer allows queries to handle a wide variety of use cases (anything that can be expressed in SQL): scans, groupings, joins, windowing, grouping over joins over scans topped by windowing, etc. The optimizer (AKA "planner") handles distribution: whether that be classic Druid scatter-gather or a more general multi-stage plan. The optimizer also handles tasks such as data locality, resource balancing, etc. @paul-rogers imo, this is absolutely where we want to go. The monolithic native queries will not serve us well in a world where queries and dataflow can get more complex. It seems clear that the ideal structure is, as you say, the standard approach of a tree of fine-grained operators that can be manipulated by the planner and then distributed and run by the execution engine. The main question to me is how we get there. It's helpful to zoom out and look at the query stack as a whole. Today's query stack has four main pieces. **First:** SQL-to-native-query translation, which happens in a Calcite based planner using a collection of Calcite rules that convert logical operators into native queries. They use a "collapsing" strategy, where a tree full of Calcite's fine-grained builtin logical operators is collapsing into monolithic native query nodes one-by-one. (One rule turns a Scan operator into a Druid query; the next rule collapses that Druid query node with a Filter node into a new Druid query node; etc.) **Second:** Scatter on the Broker. This is managed mainly by CachingClusteredClient. It scatters the query, by creating subqueries for each data server referencing the specific segments it wants those data servers to process. **Third:** Segment fan-out on the data servers. Each data server uses `QueryRunnerFactory.createRunner` to create a runner for each segment, and `QueryRunnerFactory.mergeRunners` to create a merged result stream from those runners. **Fourth:** Gather on the Broker. This is also managed mainly by CachingClusteredClient. It collects sorted streams of results from each data server, then merges them into a single stream using `Query.getResultOrdering` (to do an n-way merge of sorted results) and `QueryToolChest.mergeResults` (to combine rows when appropriate, and do follow-on processing such as projections, having filters, post-aggregation ordering, and limit). To me the key question is: what here can we keep as-is, what can be evolved, and what should be implemented fresh? 🔥 takes: - We can't drop support for the current monolithic native queries, which means we need to either retain the execution system that runs them, or we need to be able to convert them to operator trees that can run just-as-well. It seems to me that the latter (convert native to operators, then run operators) is the right long term play. But it also seems to me that we'll want to start with having two execution systems. These two systems should share as much code as feasible, so we aren't _literally_ maintaining two completely separate systems. (To enable this code sharing, we can use a strategy like the QueryRunner evolution plan that @paul-rogers has been talking about.) - It seems to me that to do multi-stage low-latency queries, CachingClusteredClient will need to be replaced: its structure is too tightly coupled to the scatter/gather concept. I don't think its replacement will share much code with the current CachingClusteredClient, because CachingClusteredClient does everything on Sequences of Java objects, and we'll want the new stuff to work on frame channels. - My guess is the QueryToolChests and QueryRunnerFactories interfaces also don't get carried through to the operator-tree execution system, but that a lot of the stuff they do internally does get carried through. Something like: we'll have QueryToolChest and QueryRunnerFactory for the current monolithic queries, some new thing for operator-based queries, and internally they will share a bunch of the same code. I'm not sure how frames fit into this. If they're going to be used for local processing (not just for IPC) then this suggests a broad move away from Sequences. - We'll want the SQL planner to be able to generate operator trees rather than trees of monolithic native queries. It seems to me that the long term play is to go to 100% operator trees, but we're likely to start by having the planner be able to generate either one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org