paul-rogers commented on issue #12262: URL: https://github.com/apache/druid/issues/12262#issuecomment-1184982176
> The monolithic native queries will not serve us well in a world where queries and dataflow can get more complex. Just to provide a bit of context for others reading this, the "monolithic native queries" point is key. Today, Druid *is* a DAG of operations, those operations just tend to be rather large: let's call them "query blocks". We can name three: the native scan query block, the native group-by query block, and a third one we'll call the "broker query block". Every Druid query is a DAG of at least two of these: ```text broker query block | scan query block ``` Each block does a wide variety of tasks. The scan query block scans segments, pushing filters down though segment selection and segment indexes and bitmaps. The group-by query block does this as well, but also does first-stage aggregation. The broker query block is mostly about merging: it distributes work, merges results, handles missing segments, does sorts, applies offsets and limits, etc. Since the broker block is used atop both the scan and group by blocks, it must handle operations and data formats specific to each of those blocks. That's what the `QueryToolChest` does: it says, "hey, when you do a merge, do it such-and-so way". The SQL-to-native-query translation step that @gianm mentioned maps the fine-grain Calcite relational operations into groups of operations assigned to native or broker query blocks, by way of defining one or more native queries. Since Druid has only scatter/gather, when a query has more than two blocks, the broker handles all but the leaf blocks. A distributed model would push blocks to other nodes. This is, roughly, what the MSQE controller does: it distributes native query blocks, wrapped in frame processors. We could build a distributed query just using query blocks. Suppose we wanted to do 11 10-way merges instead of a single 100-way merge. We can use two levels of broker blocks: ```text broker query block (root, 1 of these) | broker query block (10 of these) | scan query block (100 of these) ``` Query blocks do "just in time planing." The broker says, "hey, top-level query runner, execute this query." That query runner says, "well, I know how to limit, so I'll do that, and I'll ask the next query runner to run the rest of the query." The next guy knows how to merge but not get data. The next can handle retries, but not distribute. The guy below that knows how to distribute, and so on. This is a very clever and unusual approach: it attaches planning to each "operator". It avoids the need for a big Calcite-like planner. But, it means we can only make decisions based on local information: we can't, say, choose to reorder a join tree based on metadata. If we need global-level planning, we need a planner/optimizer to do the work. (Though, it would be cool if we could keep the local planning where it makes sense.) So, one model going forward is to keep the existing native query blocks, plan based on those blocks, distribute these blocks. This is, in fact, what Druid does today. However, as @gianm observed, that "will not serve us well in a world where queries and dataflow can get more complex." The reason is that, it becomes increasingly hard to map a Calcite DAG of relational operators to an efficient set of native query blocks: it is a force-fit. The blocks do more than we need (adding overhead) and limit expressiveness. In short, we're trying to execute the classic DAG of relational operators with query blocks designed for a different use case (the original Druid native queries.) The discussion about operators points out that the native blocks (aside from segment scans) are, in fact, made up of operations implemented as query runners and sequences. We simply disaggregate those blocks into their component operators, wrap that code in a composable form (operators), and then map more directly from the Calcite logical plan to a distributed physical plan made up of operators. We know this works well because dozens of DBs (including Oracle, SqlServer, Teradata, Spark and other industry leaders) work this way. The key question is: for the goals we want to hit, do we actually need the flexibility of the "standard" approach? Or, can we achieve our goals with less cost and risk by adding a shuffle layer between native query blocks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
